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ABSTRACT 


Generalizations  of  the  Alternating  Direction  Method  of  Multipliers  for  Large-Scale 

and  Distributed  Optimization 


by 


Wei  Deng 


The  alternating  direction  method  of  multipliers  (ADMM)  has  been  revived  in 
recent  years  due  to  its  effectiveness  at  solving  many  large-scale  and  distributed  opti¬ 
mization  problems,  particularly  arising  from  the  areas  of  compressive  sensing,  signal 
and  image  processing,  machine  learning  and  statistics.  This  thesis  makes  important 
generalizations  to  ADMM  as  well  as  extending  its  convergence  theory. 

We  propose  a  generalized  ADMM  framework  that  allows  more  options  of  solving 
the  subproblems,  either  exactly  or  approximately.  Such  generalization  is  of  great 
practical  importance  because  it  brings  more  flexibility  in  solving  the  subproblems 
efficiently,  thereby  making  the  entire  algorithm  run  faster  and  scale  better  for  large 
problems.  We  establish  its  global  convergence  and  further  show  its  linear  convergence 
under  a  variety  of  scenarios,  which  cover  a  wide  range  of  applications.  The  derived  rate 
of  convergence  also  provides  some  theoretical  guidance  for  optimizing  the  parameters 
of  the  algorithm.  In  addition,  we  introduce  a  simple  technique  to  improve  an  existing 
convergence  rate  from  0(l/k)  to  o(l/k). 

Moreover,  we  introduce  a  parallel  and  multi-block  extension  to  ADMM  for  solving 
convex  separable  problems  with  multiple  blocks  of  variables.  The  algorithm  decom¬ 
poses  the  original  problem  into  smaller  subproblems  and  solves  them  in  parallel  at 


each  iteration.  It  can  be  implemented  in  a  fnlly  distributed  manner  and  is  particularly 
attractive  for  solving  certain  large-scale  problems.  We  show  that  extending  ADMM 
straightforwardly  from  the  classic  Gauss-Seidel  setting  to  the  Jacobi  setting,  from 
two  blocks  to  multiple  blocks,  will  preserve  convergence  if  the  constraint  coefficient 
matrices  are  mutually  near-orthogonal  and  have  full  column-rank.  For  general  cases, 
we  propose  to  add  proximal  terms  of  different  kinds  to  the  subproblems,  so  that  they 
can  be  solved  in  flexible  and  efficient  ways  and  the  algorithm  converges  globally  at 
a  rate  of  o(l/k).  We  introduce  a  strategy  for  dynamically  tuning  the  parameters  of 
the  algorithm,  often  leading  to  faster  convergence  in  practice.  Numerical  results  are 
presented  to  demonstrate  the  efficiency  of  the  proposed  algorithm  in  comparison  with 
several  existing  parallel  algorithms.  We  also  implemented  our  algorithm  on  Amazon 
EC2,  an  on-demand  public  computing  cloud,  and  report  its  performance  on  very 
large-scale  basis  pursuit  problems  with  distributed  data. 
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Chapter  1 
Introduction 

With  the  rapid  advancement  of  modern  technology,  enormous  amount  of  data  is  be¬ 
ing  generated  in  diverse  areas  such  as  the  Internet,  mobile  devices,  cameras,  business, 
research  and  engineering.  There  is  a  dramatically  increasing  demand  for  processing 
and  interpreting  the  often  extremely  large  datasets.  However,  the  size  and  complexity 
of  the  modern  datasets  have  been  growing  beyond  the  ability  of  the  traditional  meth¬ 
ods,  and  have  become  a  major  bottleneck  for  many  applications.  This  phenomenon 
is  often  referred  to  as  the  “Big  Data”.  Therefore,  efficient  and  scalable  methods  are 
highly  desirable  to  cope  with  the  size  and  complexity  of  the  modern  datasets. 

This  thesis  focuses  on  a  particular  optimization  method  -  the  alternating  direction 
method  of  multipliers  (ADMM),  which  is  well  suited  to  many  of  the  Big  Data  appli¬ 
cations.  ADMM  uses  a  “divide  and  conquer”  strategy  to  decompose  an  often  large 
and  difficult  problem  into  smaller  and  easier  subproblems,  which  can  be  processed 
efficiently  and  in  parallel.  The  method  can  be  implemented  in  a  fully  decentralized 
manner  to  process  distributed  and  huge  datasets.  Due  to  these  favorable  features,  this 
classic  optimization  method  has  experienced  a  renaissance  in  the  recent  decade.  As 
a  versatile  algorithmic  tool,  ADMM  has  proven  to  be  very  effective  at  solving  many 
large-scale  and  distributed  optimization  problems,  particularly  arising  from  the  areas 
of  compressive  sensing,  signal  and  image  processing,  machine  learning  and  applied 
statistics. 

In  this  thesis,  we  generalize  the  classic  ADMM  in  various  ways,  aiming  to  further 
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improve  its  efficiency,  flexibility  and  applicability  for  large-scale  and  distributed  op¬ 
timization  problems.  We  also  make  important  extensions  to  the  convergence  theory 
and  establish  better  convergence  rates.  The  structure  of  this  thesis  is  organized  as 
follows: 

•  In  Chapter  2,  we  briefly  review  ADMM  and  the  related  literature.  We  give 
an  overview  of  several  important  applications  which  motivate  the  study  of  this 
thesis.  In  addition,  we  introduce  some  basic  mathematical  notation  and  pre¬ 
liminary  knowledge  in  convex  analysis  that  will  be  used  frequently  in  the  later 
chapters. 

•  In  Chapter  3,  we  introduce  a  generalized  ADMM  framework  that  allows  more 
options  of  solving  the  subproblems  either  exactly  or  approximately,  such  as 
linearizing  the  subproblems,  taking  one  gradient  descent  step,  and  approximat¬ 
ing  the  Hessian.  The  generalization  is  of  great  practical  importance  because  it 
brings  more  flexibility  in  solving  the  subproblems  efficiently,  thereby  making  the 
entire  algorithm  run  faster  and  scale  better  for  large  problems.  Furthermore,  we 
establish  the  global  convergence  of  the  generalized  ADMM  under  some  standard 
assumptions.  In  addition,  we  give  a  brief  literature  review  on  the  convergence 
rates  of  ADMM.  We  introduce  a  simple  technique  to  improve  an  existing  0{l/k) 
convergence  rate  to  o(l/k),  where  k  is  the  number  of  iterations. 

•  In  Chapter  4,  we  mainly  establish  the  global  linear  convergence  of  the  gen¬ 
eralized  ADMM  under  a  variety  of  scenarios.  Among  these  scenarios,  we  re¬ 
quire  that  at  least  one  of  the  two  objective  functions  is  strictly  convex  and  has 
Lipschitz  continuous  gradient,  along  with  certain  full  rank  conditions  on  the 
constraint  coefficient  matrices.  These  scenarios  commonly  arise  in  many  ap- 
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plications  and  we  will  give  several  examples.  In  addition,  we  carry  out  simple 
numerical  experiments  to  demonstrate  the  linear  convergence  of  AD  MM  on  the 
elastic  net  and  distributed  Lasso  problems. 

•  In  Chapter  5,  we  introduce  a  parallel  and  multi-block  extension  to  ADMM, 
which  is  well  suited  to  distributed  computing  and  is  particularly  attractive  for 
solving  certain  large-scale  problems.  We  show  that  extending  ADMM  straight¬ 
forwardly  from  the  classic  Gauss-Seidel  setting  to  the  Jacobi  setting,  from  2 
blocks  to  N  blocks,  will  preserve  convergence  if  the  constraint  coefficient  ma¬ 
trices  are  mutually  near-orthogonal  and  have  full  column-rank.  Without  these 
assumptions,  this  straightforward  extension  may  fail  to  converge  in  general. 
Therefore,  we  propose  a  simple  modification  by  adding  proximal  terms  of  dif¬ 
ferent  kinds  to  the  subproblems.  We  show  that  this  modification  not  only  allows 
more  flexible  and  efficient  ways  of  solving  the  subproblems,  but  more  impor¬ 
tantly  makes  the  algorithm  converges  globally  at  a  rate  of  o{l/k).  In  addition, 
we  introduce  a  strategy  for  dynamically  tuning  the  parameters  of  the  algorithm, 
often  leading  to  faster  convergence  in  practice.  We  present  numerical  results  to 
demonstrate  the  efficiency  of  the  proposed  algorithm  in  comparison  with  several 
existing  parallel  algorithms.  We  also  implemented  our  algorithm  on  Amazon 
EC2,  an  on-demand  public  computing  cloud,  and  report  its  performance  on 
very  large-scale  basis  pursuit  problems  with  distributed  data. 


•  In  Chapter  6,  we  summarize  and  conclude  the  thesis. 
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Chapter  2 


Background  and  Preliminary 


In  this  chapter,  we  provide  some  background  and  preliminary  knowledge  necessary 
to  understand  the  later  chapters  of  the  thesis.  First,  in  Section  2.1,  we  give  a  brief 
introduction  to  the  augmented  Lagrangian  method,  a  precursor  to  the  alternating 
direction  method  of  multipliers.  Then  in  Section  2.2,  we  review  the  alternating  di¬ 
rection  method  of  multipliers  and  the  related  literature.  In  Section  2.3,  we  mention 
several  important  applications  which  motivate  the  study  of  this  thesis.  In  Section 
2.4,  we  introduce  some  basic  mathematical  notation  and  preliminary  knowledge  in 
convex  analysis  that  will  be  used  frequently  in  the  later  chapters. 


2.1  Augmented  Lagrangian  Method 

The  classic  augmented  Lagrangian  method ,  also  known  as  the  method  of  multipliers , 
was  first  introduced  by  [37,49]  in  the  late  1960s  and  has  been  a  popular  methodology 
for  solving  constrained  optimization  problems.  Let  us  consider  the  canonical  convex 
optimization  problem  with  linear  constraints: 


min  fix) 
s.t.  Ax  =  6, 


(2.1) 


where  x  G  Mn  is  an  unknown  variable,  A  G  M.pxn  and  b  G  ML  are  given,  and  /  : 
ln  G  KU  {+oo}  is  an  extended-value  convex  function.  Note  that  /  is  allowed  to 
take  the  value  +oo  so  that  any  constraints  on  x  can  be  “embedded”  in  /.  That  is,  if 
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x  is  constrained  in  some  closed  convex  set  X  C  Mn,  then  it  is  equivalent  to  include 
the  indicator  functions  Ix(x)  as  part  of  the  objective  function  /,  where  the  indicator 
function  of  a  convex  set  X  is  defined  by 

{0  if  x  G  X, 

(2.2) 

■boo  if  x  X . 

The  so-called  augmented  Lagrangian  of  (2.1)  is  given  by 

A)  :=  f(x)  -  A T(Ax  -b)  +  ^|| Ax  -  b\\l,  (2.3) 

where  A  G  is  the  Lagrangian  multiplier  (or  dual  variable),  and  fd  >  0  is  a  penalty 
parameter.  It  can  be  viewed  as  a  combination  of  the  standard  Lagrangian  function 
and  a  quadratic  penalty  term  on  the  constraints. 

The  augmented  Lagrangian  method  is  an  iterative  algorithmic  framework  that 
solves  a  sequence  of  unconstrained  subproblems  based  on  the  augmented  Lagrangian 
function.  It  starts  with  an  initial  estimate  A0  to  the  Lagrangian  multiplier  and  iterates 
as  follows:  for  k  —  0, 1, . . ., 


k+1  =  argmin  Cp{x,  Xk) 

X 

(2.4) 

k+1  =  \k  -  /3(Axk+1  -  b) 

(2.5) 

At  each  iteration,  it  minimizes  the  augmented  Lagrangian  over  the  primal  variable  x 
by  fixing  the  dual  variable  A,  and  then  makes  a  simple  update  to  A  using  the  updated 
x.  In  addition,  the  fixed  penalty  parameter  (3  can  also  be  replaced  by  a  sequence  of 
variable  parameters  {/3k}  updated  at  every  iteration. 

The  augmented  Lagrangian  method  is  closely  related  to  the  dual  ascent  method  (or 
dual  subgradient  method)  [53]  as  well  as  the  quadratic  penalty  method.  Compared  to 
these  methods,  the  augmented  Lagrangian  method  is  more  robust  and  converges  under 
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much  milder  conditions.  Unlike  the  quadratic  penalty  method  where  the  penalty  pa¬ 
rameter  needs  to  go  to  infinity,  the  penalty  parameter  /3  of  the  augmented  Lagrangian 
method  can  be  fixed  and  stay  much  smaller,  thereby  avoiding  ill-conditioning  of  the 
subproblems. 

2.2  Alternating  Direction  Method  of  Multipliers 

The  alternating  direction  method  of  multipliers  (ADMM)  is  a  variant  of  the  aug¬ 
mented  Lagrangian  method,  which  is  applied  to  constrained  optimization  problems 
with  separable  objective  functions  in  the  following  form: 


min  /( x)  +  g(y) 


(2.6) 


s.t.  Ax  +  By  =  b, 


where  x  G  Mn  and  y  G  Mm  are  unknown  variables,  A  G  Rpxn  and  B  G  Mpxm  are  given 
matrices,  &  G  ip  is  a  given  vector,  and  /  :  R"  G  RU  {+00}  and  g  :  Mm  Gill  {+00} 
are  extended-value  convex  functions.  Constraints  x  G  X  and  y  £  y,  where  X  C  Mn 
and  y  C  Mm  are  closed  convex  sets,  can  be  included  as  the  indicator  functions  Ix(x) 
and  h  (y)  in  the  objective  functions  /  and  g,  respectively. 

Applying  the  augmented  Lagrangian  method  to  (2.6)  yields  the  following  itera¬ 
tions: 


(2.7) 


\k+ 1  =  A*  -  0(Axk+1  +  Byk+1  -  b). 


(2.8) 


Here,  the  augmented  Lagrangian  of  (2.6)  is  given  by: 


(2.9) 
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where  A  e  Mp  is  the  Lagrangian  multiplier  and  /3  >  0  is  a  penalty  parameter.  Note 
that  though  x  and  y  are  separable  in  the  objective  functions,  they  are  coupled  in  the 
augmented  Lagrangian  through  the  quadratic  penalty  term. 

In  many  applications,  it  is  often  relatively  easy  to  minimize  the  functions  /  and 
g  individually;  however,  minimizing  both  of  /  and  g  at  the  same  time  is  much  more 
difficult.  Therefore,  the  joint  minimization  step  (2.7)  of  the  augmented  Lagrangian 
method  often  becomes  a  bottleneck  for  these  applications.  In  contrast,  the  alternating 
direction  method  of  multipliers  (see  Algorithm  1  below)  replaces  the  joint  minimiza¬ 
tion  step  by  two  “alternating  minimization”  steps  in  which  Cp  ( x ,  y ,  Xk)  in  (2.7)  is 
minimized  over  x  and  y  separately,  one  after  another.  We  refer  to  the  step  3  and  step 
4  of  Algorithm  1  as  ?/-subproblem  and  x-subproblem  of  ADMM,  respectively. 

Algorithm  1:  Classic  ADMM 


1  Initialize  x°,  A0,  f3  >  0; 

2  for  k  —  0,  1, ...  do 
yk+i  _  arg  nijn  £p(xk,  y,  Afc); 

xk+1  =  arguing.  £p(x,  yk+1,  Afc); 

Xk+i  =  Xk  _  p(Axk+ 1  +  Byk+ 1  _  })y 


Compared  to  the  augmented  Lagrangian  method,  ADMM  can  take  advantage  of 
the  separability  of  the  problem  (2.6)  and  thus  is  particularly  more  effective  on  this 
kind  of  problems.  ADMM  may  take  more  iterations  to  converge,  due  to  less  accurate 
minimization  of  the  augmented  Lagrangian  Cy(x,y,  Xk)  than  (2.7)  at  each  iteration. 
However,  ADMM  often  runs  faster  due  to  the  much  cheaper  subproblems. 

ADMM  was  first  introduced  in  the  1970s  by  [21,23]  with  the  application  in  solving 
partial  differential  equations.  Its  root  traces  back  to  the  Douglas-Rachford  splitting 


method  [15]  in  the  1950s.  However,  this  classic  optimization  method  was  not  widely 
known  until  the  recent  past.  In  the  last  few  years,  ADMM  has  experienced  a  sig¬ 
nificant  revival  and  has  found  numerous  successful  applications,  particularly  in  the 
areas  of  compressive  sensing,  signal  and  image  processing,  machine  learning,  applied 
statistics  and  operations  research.  We  refer  to  [4,6,14,18,27,29,39,42,59,61]  for  a 
number  of  recent  applications.  With  the  relative  ease  of  implementation,  ADMM  of¬ 
ten  leads  to  state-of-the-art  performance  on  many  large-scale  problems.  Also,  ADMM 
is  well  suited  to  parallel  and  distributed  computing.  It  can  be  implemented  in  a  fully 
decentralized  computational  environment,  with  the  scalability  to  process  very  huge 
datasets.  These  benefits  have  mainly  contributed  to  the  renaissance  of  ADMM  and 
are  of  great  importance  to  modern  applications  with  extremely  large  datasets. 

2.3  Applications 

The  alternating  direction  method  of  multipliers  has  wide  applications  in  diverse  areas. 
In  this  section,  we  review  several  important  applications  as  motivating  examples. 

2.3.1  Convex  Regularization 

Convex  regularization  models  have  been  widely  used  in  various  applications: 

min  f(Ax  —  b)  +  g(x)  (2.10) 

X 

where  x  £  Mn  is  unknown  variable,  A  £  Mpxn  and  b  £  Rp  are  given  data,  and  /  and 
g  are  convex  functions.  Here,  /  is  referred  to  as  the  loss  (or  data  fidelity )  function 
for  penalizing  data-htting  errors;  g  is  the  regularization  function  in  order  to  prevent 
over-fitting  or  promote  certain  structures  in  the  solutions.  For  example,  the  loss 
function  /  can  be  chosen  as  squared  t^-norm  ||  •  |||  (least  squares),  f'l-norm  ||  •  1^  (least 
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absolute  deviations),  Huber  function,  indicator  function  of  a  closed  convex  set,  and  so 
on.  Several  commonly  used  regularization  functions  g  include  squared  £2-norm  ||  ■  ||§, 
total  variation,  tj-norm,  nuclear  norm,  just  to  name  a  few. 

Though  the  problem  (2.10)  is  not  in  the  form  of  (2.6),  it  can  be  transformed  to 
(2.6)  after  introducing  auxiliary  variables  and  constraints.  For  example,  introducing 
y  =  x,  the  problem  (2.10)  becomes 

min  f(Ax  —  b)  +  g(y ) 

x’y  (2.11) 


s.t.  x  —  y  =  0. 


Alternatively,  one  may  also  introduce  y  =  Ax  —  b  to  reformulate  (2.10)  as 

min  g(x)  +  f(y) 

x,y 

s.t.  Ax  —  y  =  b. 


(2.12) 


Then  ADMM  is  readily  applicable  to  either  (2.11)  or  (2.12).  The  way  of  reformulating 
the  problems  should  be  chosen  wisely  so  that  the  resulting  subproblems  can  be  carried 
out  efficiently. 


2.3.2  Sparse  and  Low- Rank  Optimization 

In  recent  years,  the  problems  of  finding  sparse  or/and  low-rank  solutions  have  received 
tremendous  attention  from  researchers  and  engineers,  particularly  those  in  the  areas 
of  compressive  sensing,  signal  and  image  processing,  machine  learning  and  statistics. 
Many  of  these  problems  are  posed  as  convex  regularization  models,  where  the  A -norm 
or  nuclear  norm  are  often  used  as  the  regularization  function  to  enforce  sparsity  or 
low-rankness  in  the  solutions.  For  example,  we  consider  the  following  problems: 

•  Sparse  recovery  [11]: 

min  ||a:||i  H - \\Ax  —  b\\l, 

11  11  2/j,  ll2’ 


(2.13) 
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where  A  G  Mmxn,  b  G  Mm,  and  /i  >  0  is  a  regularization  parameter.  Here  the  £\ 
norm  ||a;||i  :=  ^-!=1  |ay|  is  known  to  promote  sparse  solutions.  In  compressive 
sensing,  this  problem  is  commonly  used  for  reconstructing  sparse  signals  from 
a  relatively  small  number  of  (noisy)  measurements.  In  machine  learning  and 
statistics,  it  is  also  known  as  the  Lasso  problem  [56]  for  feature  selection. 

•  Low-rank  matrix  recovery /matrix  completion  [7]: 

nun  ||X||.  +  -  &II2,  (2-14) 

where  X  G  Mmxri  js  an  unknown  low-rank  matrix,  A  :  Mmxn  — >  ML  is  a  linear 
operator,  b  G  Mp  is  our  observed  data,  and  /a  >  0  is  a  regularization  parameter. 
Here,  the  nuclear  norm  ||X||*  denotes  the  sum  of  singular  values  of  X  which  is 
well  known  for  promoting  low-rank  solutions. 

Such  problems  pose  computational  challenges  due  to  the  non-smoothness  of  the 
regularization  functions  as  well  as  the  often  large  scale  of  the  problems.  Traditional 
optimization  methods,  such  as  the  interior-point  methods,  often  cannot  take  advan¬ 
tage  of  the  structures  of  the  problems  and  are  computationally  expensive  to  handle 
very  large  problems. 

We  illustrate  how  ADMM  can  be  effectively  applied  to  these  problems,  giving  rise 
to  very  efficient  and  scalable  algorithms.  Let  us  take  the  sparse  recovery  problem 
(2.13)  as  an  illustrative  example.  By  introducing  the  auxiliary  variable  y  =  x,  (2.13) 
can  be  reformulated  as: 

min  Hj/II!  +  ^-\\ Ax  -  b\\\ 
x,y  1/1 


s.t.  x  —  y  =  0. 


(2.15) 
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Applying  AD  MM  to  the  above  problem  yields  the  following  iterations: 

yk+l  =  arg min  ||j/||i  +  ^  || y-  ( xk  -  Afc//3)  ,  (2.16) 

it+1  =  argmin  T||  Ax  -  bg  +  §  ||x  -  yh+'  -  A*/j9||“ ,  (2.17) 

Afc+1  =  Xk  —  f3(xk+1  —  yk+1).  (2.18) 

Note  that  the  y-subproblem  (2.16)  admits  a  simple  closed-form  solution  by  the  well- 
known  shrinkage  or  soft-thresholding  operator.  For  the  low-rank  matrix  recovery 
problem  (2.14),  the  application  of  ADMM  follows  exactly  the  same,  except  that  the 
closed-form  solution  of  the  Y -subproblems  takes  a  slightly  different  form  by  the  so- 
called  singular  value  soft-thresholding. 

The  ic-subproblem  (2.17)  solves  a  quadratic  problem  that  involves  inverting  a 
matrix  (ft I  +  ATA/y )  or  solving  a  linear  system.  We  may  cache  an  initial  matrix 
factorization  to  make  the  x-subproblems  much  cheaper  to  carry  out.  For  very  large 
problems  where  caching  matrix  factorization  may  not  be  affordable,  it  is  more  efficient 
to  just  solve  the  x-subproblem  approximately,  such  as  taking  one  gradient  descent  step 
as  suggested  in  [61].  Though  more  iterations  may  be  needed  due  to  the  less  accurate 
subproblem,  the  entire  algorithm  runs  faster  because  the  subproblems  can  be  carried 
out  in  much  less  time.  This  observation  motivates  us  to  generalize  the  ADMM  to 
allow  more  options  of  solving  the  subproblems,  possibly  less  exactly  but  faster,  in 
order  to  make  ADMM  more  efficient  with  wider  applicability.  We  will  discuss  such 
generalization  to  ADMM  and  analyze  its  convergence  in  Chapter  3  and  Chapter  4. 
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2.3.3  Consensus  and  Sharing  Optimization 


Consider  in  a  network  of  N  nodes,  the  problem  of  minimizing  the  sum  of  N  functions, 
one  from  each  node,  over  a  common  variable  x.  This  problem  can  be  written  as 


N 


min  > 


(2.19) 


2—1 


Let  each  node  i  keep  vector  x,  G  Mn  as  its  copy  of  x.  To  reach  a  consensus  among  ay, 
i  —  1, . . . ,  N,  a  common  approach  is  to  introduce  a  global  common  variable  y  and  get 

N 


min  V]  fi(xi),  s.t.  xt  -  y  =  0,  i  =  l,...,N. 

{xi},y  “ 

2—1 


(2.20) 


This  is  the  well-known  global  consensus  problem;  see  [4]  for  a  review.  With  an 
objective  function  g  on  the  global  variable  y,  we  have  the  global  variable  consensus 
problem  with  regularization: 

N 


min  V  fi(xi)  +  g(y),  s.t.  xt-y  =  0,  i  =  1, . . . ,  N, 

1.7/  ^  ^ 


2—  1 


(2.21) 


where  g(y)  is  a  convex  function, 

The  following  sharing  problem  is  also  nicely  reviewed  in  [4]: 


N 


N 


min  ^2  fi(xi)  +  g  ^  yt  ,  s.t.  xi-yi  =  0,  i  =  1, . . . ,  N,  (2.22) 

j  iV 


2—  1 


,  2—  1 


where  /,’s  are  local  cost  functions  and  g  is  the  shared  cost  function  by  all  the  nodes 

i. 


ADMM  applied  to  the  problems  (2.20),  (2.21)  and  (2.22)  is  particularly  suitable 
for  distributed  implementation,  since  the  x-subproblem  can  be  decomposed  into  N 
independent  Xj-subproblems,  and  the  update  to  the  multiplier  A  can  also  be  done  at 
each  node  i.  We  will  discuss  this  topic  in  more  details  in  Chapter  5. 
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2.4  Notation  and  Preliminary 


We  let  (-,  •)  denote  the  standard  inner  product  in  the  Euclidean  space.  That  is,  for 
any  x  =  (xi,  x2,  ■  ■  ■ ,  xn)T  G  Mn  and  y  =  (yu  y2, . . . ,  yn)T  e  Mn, 

n 

(x,  y)  :=  xTy  =  ^2  xiVi-  (2-23) 

i=l 

We  use  ||  •  ||  to  denote  the  t2-norm.  That  is,  for  a  vector  x  =  (xi,  x2, . . . ,  xn)T  G  Mn, 


For  a  matrix  M  G  Mmxn, 
value  of  M : 


x 


:=  a/ xtx  = 


x? 


(2.24) 


|| Af ||  denotes  the  spectral  norm,  i.e.,  the  largest  singular 


||M||  :=  v^pTM). 


(2.25) 


We  let  Amin(-)  and  Amax(-)  denote  the  smallest  and  largest  eigenvalues  of  a  real  sym¬ 
metric  matrix,  respectively. 

For  a  positive  definite  matrix  G  G  Mnxn,  we  define  the  G-norm  as  follows: 


x||g  :=  VxTGx,  Vx  G  Mn. 


(2.26) 


If  the  matrix  G  is  positive  semi-definite,  then  ||  •  ||c>  is  a  semi-norm. 

Let  us  consider  an  extended  real- value  function  /  :  Mn  — *  MU{+oo},  and  let  dom / 
denote  its  domain.  We  say  the  function  is  proper  if  there  is  at  least  one  x  G  dom / 
such  that  /(x)  <  +oo.  We  say  the  function  is  closed  if  for  any  a  G  M  the  set 
{x  G  dom /  :  /(x)  <  a}  is  a  closed  set. 

The  function  is  convex  if  dom /  is  a  convex  set  and  the  following  inequality  holds 
for  all  t  G  [0, 1]  and  all  x,i/G  dom/: 


f(tx  +  (1  -  t)y)  <  tf  (x)  +  (1  -  t)f(y). 


(2.27) 
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The  function  is  called  strictly  convex  if  the  above  inequality  holds  strictly  (with  “<”) 
for  all  t  G  (0, 1)  and  all  x  ^  y  such  that  f(x)  <  +oo  and  f(y)  <  +oo.  The  function 
is  called  strongly  convex  with  constant  v  >  0  if  the  function  f(x)  —  |||a;||2  is  convex, 
or  equivalently,  if  the  following  inequality  holds  for  all  t  G  [0, 1]  and  all  x,y  G  dom /: 

f(tx  +  (1  -  t)y)  <  tf(x )  +  (1  -  t)f(y)  -  1  -  t)\\x  -  y\\'2.  (2.28) 

For  a  convex  function  /,  we  let  the  subdifferential  (i.e.,  the  set  of  all  subgradients) 
of  /  at  x  G  dom /  be  denoted  by  df(x): 

df(x)  :=  {s  G  Rn  :  sT (y  -  x)  <  f{y)  -  f(x),  Wy  G  dom/}  .  (2.29) 

If  /  is  strongly  convex  with  constant  u  >  0,  then  it  can  be  shown  that  for  any 
x  G  dom /  and  s  G  d f(x),  the  following  holds: 

f(y)  ~  f(x)  >  sT(y  -x)  +  ^\\x-  y\\2.  (2.30) 

For  a  differentiable  function  /,  the  gradient  V/  is  called  Lipschitz  continuous  with 
constant  L  >  0  if 

II V/(x)  -  V/(y)||  <  L\\x  -  y\\,  Mx,y  G  dom/.  (2.31) 

Let  us  review  some  basic  properties  of  convex  functions  which  will  be  used  fre¬ 
quently  in  the  later  chapters. 

Lemma  2.1  (monotonicity  of  subdifferential).  If  function  f  is  convex,  then  for  any 
x,  y  G  dom  /,  we  have 

(x-y,s-t)  >  0,  Vs  G  df(x),  t  G  df(y).  (2.32) 

If  function  f  is  strongly  convex  with  constant  v  >  0,  then  for  any  x,y  G  dom  /,  we 
have 


{x-y,s-t)  >  v\\x  -  y ||2,  Vs  G  df(x),  t  G  d f(y). 


(2.33) 
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Proof.  For  brevity,  we  only  prove  (2.33)  since  it  reduces  to  (2.32)  when  o  =  0.  By 
(2.30),  for  any  s  G  df(x)  and  t  e  df(y),  we  have 

f(y)  ~  f(x)  -  sT(y  -x)>^\\x-  y\\ 2, 
f{x)  -  f(y )  -  tT(x  -y)>^\\x-  y\\2. 

Adding  these  two  inequalities  together  yields  (2.33).  □ 

Lemma  2.2.  If  a  convex  function  f  has  Lipschitz  continuous  gradient  V/  with  con¬ 
stant  L  >  0,  then  the  following  inequality  holds: 

{x~y,  V/(x)  —  V/(y))  >  -jr||V/(x)  —  V  f(y)\\2,  Vx,yedomf.  (2.34) 

For  the  sake  of  brevity,  we  omit  the  proof  which  can  be  found  in  many  convex 


analysis  books,  for  example  [44], 
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Chapter  3 

Generalized  ADMM  with  Simplified  Subproblems 

In  this  chapter,  we  consider  a  generalized  ADMM  framework  that  allows  various 
options  of  simplifying  the  subproblems.  The  generalization  of  ADMM  is  of  great 
practical  importance  because  it  brings  more  flexibility  and  efficiency  in  carrying  out 
the  iterations,  thereby  making  the  entire  algorithm  run  faster  and  scale  better  for 
large  problems. 

First,  in  Section  3.1,  we  discuss  the  motivation  of  the  generalization  and  give  some 
motivating  examples.  In  Section  3.2,  we  present  the  technical  details  of  the  generalized 
ADMM  framework,  as  well  as  comparison  to  some  closely  related  work  [30,62],  Then, 
in  Section  3.3,  we  establish  the  global  convergence  of  the  generalized  ADMM  under 
some  standard  assumptions.  In  Section  3.4,  we  briefly  review  some  existing  results  on 
the  convergence  rates  of  ADMM.  Furthermore,  in  Section  3.5,  we  introduce  a  simple 
technique  to  improve  an  existing  0(l/k)  convergence  rate  to  o(l/k),  where  k  is  the 
number  of  iterations. 

3.1  Motivation 

The  efficiency  of  ADMM  often  relies  on  the  fact  that  the  original  problem  can  be  de¬ 
composed  into  easier  subproblems  which  admit  efficient  solvers  or  simple  closed-form 
solutions.  In  many  applications,  however,  the  subproblems  of  ADMM  are  not  always 
easy  to  carry  out.  Especially  for  many  large-scale  problems,  solving  the  subproblems 
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exactly  may  become  computationally  expensive,  and  applying  ADMM  straightfor¬ 
wardly  may  be  inefficient.  If  one  spends  too  much  time  on  solving  each  single  sub¬ 
problem,  the  entire  algorithm  will  take  a  long  time  to  finish.  Often,  even  though 
some  subproblems  are  expensive  to  solve  exactly,  it  is  much  cheaper  to  compute  ap¬ 
proximate  solutions  to  the  subproblems  which  are  still  good  enough  to  guarantee  the 
convergence.  Although  it  may  take  more  iterations  to  converge  due  to  less  accurate 
subproblems,  the  entire  algorithm  can  finish  in  less  time  since  each  iteration  runs 
much  faster. 

In  practice,  many  variants  of  ADMM  have  been  developed  in  which  the  subprob¬ 
lems  are  solved  inexactly  but  faster  in  different  ways.  Below  we  give  several  useful 
examples  on  how  to  come  up  with  efficient  methods  to  solve  the  subproblems  approx¬ 
imately.  For  the  sake  of  brevity,  we  consider  only  the  .x-subproblem: 


min  f(x)  —  (A k)T Ax  +  ^||Ax  +  Byk+1  —  b\\2, 

x  2 


(3.1) 


while  the  same  arguments  will  apply  to  the  y-subproblem  as  well. 

•  Prox-linear  ADMM.  The  Prox- linear  method  was  introduced  in  [10],  which 


linearizes  the  quadratic  term  f  || Ax  +  Byk+1  —  b ||2  in  (3.1)  at  x  =  xk  and  adds 
a  proximal  term  ||a:  —  xk\\2.  That  is,  it  solves  the  following  problem: 


(3.2) 


where  r  >  0  is  a  proximal  parameter  and  pk  :=  /3AT(Axk  +  Byk+1  —  b  —  A k//3) 
is  the  gradient  of  the  last  two  terms  of  (3.1)  at  x  =  xk.  Compared  to  the 
original  x-subproblem  (3.1),  it  essentially  replaces  the  Hessian  matrix  (3ATA  of 
the  quadratic  term  by  an  identity  matrix  (A. 


Prox-linear  subproblems  are  much  easier  to  compute  in  various  applications. 
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For  example,  if  /  is  a  separable*  function,  problem  (3.2)  reduces  to  a  set  of  in¬ 
dependent  one-dimensional  problems.  In  particular,  if  /  is  i\  norm,  the  solution 
is  given  in  the  closed  form  by  the  so-called  soft-thresholding.  If  /  is  the  matrix 
nuclear  norm,  then  the  solution  is  given  by  singular-value  soft-thresholding.  If 
f(x)  =  ||$z||i  where  $  is  an  orthogonal  operator  or  a  tight  frame,  (3.2)  also 
has  a  closed-form  solution.  If  /  is  total  variation,  (3.2)  can  be  solved  by  graph- 
cut  [5,26].  There  are  a  large  number  of  such  examples  in  signal  processing, 
imaging,  statistics,  machine  learning,  etc. 

•  Gradient-descent  ADMM.  In  many  situations,  the  function  /  is  quadratic 
and  hence  the  x-subproblem  amounts  to  solving  a  linear  system.  When  one 
must  solve  a  large,  nontrivial  linear  system  at  every  iteration,  the  overall  com¬ 
putational  cost  can  be  very  expensive.  Alternatively,  one  may  simply  take  a 
gradient  descent  step  to  compute  an  approximate  solution: 


(3.3) 


where  a  >  0  is  a  step  size  and 


gk  ■=  Vf(xk)  +  f3AT(Axk  +  Byk+1  -  b  -  \k/(3) 


(3.4) 


is  the  gradient  of  the  augmented  Lagrangian  £a(x,  yk+1,  \k)  at  x  —  xk.  Appar¬ 


ently,  taking  one  gradient  step  only  requires  several  matrix-vector  multiplica¬ 
tions,  which  significantly  reduces  the  cost  and  has  a  clear  speed  advantage. 

•  ADMM  with  fast  approximation  of  A  (and/or  B).  The  x-subproblem 


contains  the  quadratic  term  | xTATAx.  Sometimes,  replacing  AT A  by  a  certain 


*We  call  a  function  /  :  Rra  — >  R  separable  if  f(x)  =  +  •  •  •  +  fn(%n)  for  x  = 


{x\,X2,  •  •  • ,  xn)  €  ffi”,  where  /*  :  R  — ►  R,  Vi  =  1,  2, . . .  ,n 
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symmetric  matrix  D  ~  AT A  makes  the  problem  easier  to  compute.  For  ex¬ 
ample,  this  approach  is  useful  when  AT A  is  nearly  diagonal  (D  is  the  diagonal 
matrix),  or  is  an  orthogonal  matrix  plus  error  (D  is  the  orthogonal  matrix), 
as  well  as  when  an  off-the-grid  operator  A  can  be  approximated  by  its  on-the- 
grid  counterpart  that  has  very  fast  implementations  (e.g.,  the  discrete  Fourier 
transforms  and  FFT). 


3.2  Generalized  ADMM 


As  discussed  above,  variants  of  ADMM  that  allow  subproblems  to  be  solved  approx¬ 
imately  and  faster  are  very  important  to  the  applications  in  which  it  is  expensive  to 
exactly  solve  either  the  x-subproblem  or  the  y-subproblem,  or  both  of  them.  For  this 
reason,  we  present  a  generalized  ADMM  framework  (Algorithm  2  below)  by  allow¬ 
ing  various  ways  of  simplifying  the  subproblems.  In  particular,  it  includes  not  only 
the  classic  ADMM  (Algorithm  1),  but  also  the  aforementioned  Prox-linear  ADMM, 
Gradient- descent  ADMM,  ADMM  with  fast  approximation  of  A  and/or  B  as  well  as 
combinations  of  them  as  special  cases. 

Algorithm  2:  Generalized  ADMM 


1  Choose  Q  >z  0  and  a  symmetric  matrix  P  (possibly  indefinite); 

2  Initialize  x°,  y°,  A0,  (3  >  0,7  >  0; 

3  for  k  —  0,  1, ...  do 

yk+i  =  arg  rniri)j  CA(xk,  y,  Xk)  +  \{y  -  yk)TQ(y  -  yk ); 
xk+l  =  argmin;[.  CA{x,  yk+1,  Xk)  +  \(x  —  xk)TP(x  —  xk ); 

Xk+ 1  =  Xk  -  7 P{Axk+1  +  Byk+1  -  b). 


Compared  to  Algorithm  1,  Algorithm  2  adds  ||| y  —  yk\\q  and  \\\x  —  xk\\2P  to  the 
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y-  and  x-subproblerns,  respectively,  and  assigns  7  as  the  step  size  for  the  update  of 
A.  Here,  we  slightly  abuse  the  notion  ||x|||^  :=  xTMx  as  we  allow  any  symmetric 
matrix  M.  Different  choices  of  P  and  Q  are  overviewed  in  the  next  subsection.  They 
can  make  steps  4  and  5  of  Algorithm  2  easier  than  those  of  Algorithm  1.  Obviously, 
Algoirthm  2  reduces  to  Algorithm  1  when  P  =  0,  Q  =  0  and  7  =  1. 

We  do  not  fix  7  =  1  like  in  most  of  the  ADMM  literature  since  7  plays  an 
important  role  in  convergence  and  speed.  For  example,  when  P  =  0  and  Q  —  0,  any 
7  G  (0,  (a/5  +  l)/2)  guarantees  the  convergence  of  Algorithm  2  [22],  but  7  =  1.618  ~ 
(a/5  +  1)/2  makes  it  converge  noticeably  faster  than  7  =  1.  As  we  will  show  in  Section 
3.3,  the  range  of  7  depends  on  P  and  Q,  as  well  as  f3.  When  P  is  indefinite,  7  must 
be  smaller  than  1  or  the  iteration  may  diverge. 

3.2.1  Choices  of  P  and  Q. 

The  general  goal  is  to  wisely  choose  P  and  Q  so  that  the  subproblems  of  Algorithm 
2  become  much  easier  to  carry  out  and  the  entire  algorithm  runs  in  less  time.  Let  us 
give  a  few  examples  of  matrix  P  in  step  5  of  Algorithm  2.  These  examples  also  apply 
to  Q  in  step  4  as  well.  Note  that  P  and  Q  can  be  different. 

•  Exact  ADMM.  If  the  original  .x'-subproblem  in  Algorithm  1  is  already  easy  to 
compute  exactly,  then  one  can  simply  set  P  =  0. 

•  Prox-linear  ADMM.  Setting 

P=-I-(3AtA ,  (3.5) 

T 

gives  rise  to  the  prox-linear  problem  (3.2). 
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•  Gradient-descent  ADMM.  When  function  /  is  quadratic ,  letting 

P=^I-Hf-  [3AtA ,  where  Hf  :=  V2f(x )  b  0,  (3.6) 

yields  the  following  x-subproblern: 

min  (gk)T(x  —  xk)  + -^—\\x  —  xk\\l,  (3.7) 

x  2a 

where  gk  is  the  gradient  given  by  (3.4).  Clearly,  it  amounts  to  the  gradient 
descent  step  (3.3). 


•  ADMM  with  fast  approximation  of  A  (and/or  B ).  To  replace  ATA  by  a 
certain  symmetric  matrix  D  «  AT A.  one  can  let 


P  =  P(D  -  ATA). 


(3.8) 


This  choice  of  P  effectively  turns  ^xT AT Ax  into  Dx 


since 


li\\Ax  +  Byk+'-b\\l+1-\\x-xkfp 
=  —xtDx  +  [terms  linear  in  x]  +  [terms  independent  of  x] . 


Note  that  P  can  be  indefinite.  This  approach  also  applies  to  step  4  of  Algorithm 
2  as  long  as  Q  >-  0. 


In  ADMM,  the  two  subproblems  can  be  solved  in  either  order  (but  fixed  through¬ 
out  the  iterations).  However,  when  one  subproblem  is  solved  less  exactly  than  the 
other,  Algorithm  2  tends  to  run  faster  if  the  less  exact  one  is  solved  later  —  as¬ 
signed  as  step  5  of  Algorithm  2  —  because  at  each  iteration,  the  ADMM  updates  the 
variables  in  the  Gauss-Seidel  fashion.  If  the  less  exact  one  runs  first,  its  relatively 
inaccurate  solution  will  then  affect  the  more  exact  step,  making  its  solution  also  inac¬ 
curate.  Since  the  less  exact  subproblem  should  be  assigned  as  the  later  step  5,  more 
choices  of  P  are  needed  than  Q,  which  is  the  case  in  Algorithm  2. 


22 


3.2.2  Related  Work 

Let  us  overview  two  works  very  related  to  Algorithm  2.  Both  works  [30, 62]  consider 
adding  proximal  terms  to  the  subproblems,  where  P  and  Q  are  restricted  to  positive 
semi-definite  matrices.  In  [30],  the  convergence  analysis  is  restricted  to  the  case  of 
7  =  1  and  differentiable  functions  /  and  g ;  on  the  other  hand,  the  quadratic  penalty 
term  of  augmented  Lagrangian  is  further  generalized  to  \\Ax-\- By —b\\2Hk  for  a  sequence 
of  bounded  positive  definite  matrices  {Hk}.  The  work  [62]  extends  the  scalar  7  to 
a  positive  definite  matrix  C  and  establishes  convergence  assuming  that  A  =  I  and 
the  smallest  eigenvalue  of  C  is  no  greater  than  1,  which  corresponds  to  7  <  1  when 
C  =  7/. 

Our  work  makes  meaningful  extensions  to  the  existing  convergence  theory  in  [30, 
62],  Specifically,  the  step  size  7  is  less  restrictive,  and  P  is  allowed  to  be  indefinite. 
These  extensions  translate  to  faster  convergence  and  more  options  of  solving  the  x- 
subproblem  efficiently.  While  there  is  no  rate  of  convergence  given  in  [30,62],  we  will 
further  establish  linear  convergence  rates  in  Chapter  4. 

3.3  Global  Convergence 

In  this  section,  we  show  the  global  convergence  of  Algorithm  2.  The  proof  steps  are 
similar  to  the  existing  convergence  theory  in  [30, 62]  but  are  adapted  to  Algorithm  2 
with  several  extensions.  The  analysis  in  this  section  will  also  be  used  frequently  in 
the  next  chapter  to  establish  linear  convergence  under  additional  assumptions. 
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3.3.1  Assumptions 

Throughout  the  rest  of  this  chapter  as  well  as  Chapter  4,  we  make  the  following 
standard  assumptions. 

Assumption  3.1.  There  exists  a  saddle  point  u*  :  =  ( x*,y*,X *)  to  problem  (2.6), 
namely,  x* ,  y* ,  and  X*  satisfy  the  KKT  conditions: 

AT X*  G  df(x*),  (3.9) 

BT X*  G  dg(y*),  (3.10) 

Ax*  +  By*~b=  0.  (3.11) 

These  conditions  can  be  written  in  a  more  compact  form  by  the  following  varia¬ 
tional  inequality: 

f(x)  +  g(y)  -  f(x*)  -  g(y*)  +  (u  -  u*)T H(u*)  >  0,  Mu,  (3.12) 


where 


( 


H(u)  := 


\ 


-ATA 
-BT  X 

yAx  +  By  —  bj 

When  assumption  3.1  fails  to  hold,  the  ADMM  method  has  either  unsolvable  or 
unbounded  subproblems  or  a  diverging  sequence  of  Xk. 


Assumption  3.2.  Functions  f  and  g  are  closed,  proper  and  convex. 


We  define  scalars  Vf  and  vg  as  the  modulus  of  /  and  g,  respectively.  Following 
from  Lemma  2.1,  they  satisfy 

(si  -  s2,  x i  -  x2)  >  Vf\\xi  -  x2\\2,  Mxi,  x2,  si  G  df(xi),  s2  G  df(x2),  (3.13) 
<^1  —  ^2, 2/1  —  2/2)  >  II2/1  —  2/2II2,  Myi,  y2,  t\  G  dg(y±),  t2  G  dg(y2).  (3.14) 
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From  the  convexity  of  /  and  g,  it  follows  that  Vf  >  0  and  vg  >  0.  They  are  (strictly) 
positive  if  the  functions  are  strongly  convex. 

Throughout  this  chapter,  since  /  and  g  are  allowed  to  be  any  convex  functions, 
we  can  treat  Uf  =  0  and  ug  —  0  for  the  worst  case.  In  Chapter  4,  we  will  further 
assume  ff  >  0  or/and  ug  >  0,  as  strong  convexity  of  at  least  one  of  the  functions  is 
needed  to  show  linear  convergence. 

3.3.2  Convergence  Analysis 

For  notation  simplicity,  we  introduce 

A:=A k  -  p(Axk+1  +  Byk+1 -b).  (3.15) 

If  7  =  1,  then  A  =  Afc+1;  otherwise, 

A  -  Afc+1  =  (7  -  1  )P(Axk+1  +  Byk+1  -  b)  =  (1  -  ^)(Afc  -  Afc+1).  (3.16) 

A 

This  relation  between  A  and  Afc+1  is  used  frequently  in  our  analysis.  Let 


' x 

* 

u  :  = 

y* 

,  uk  :  = 

yk 

,  u  :  = 

,  for  k  —  0, 1, . . . , 

(3.17) 

w 

lAV 

X  J 

where  u*  is  a  KKT  point,  uk  is  the  current  point,  and  u  is  the  next  point  as  if  7  =  1, 
and 


( In 

\ 

(  P 

\ 

(  P 

\ 

G0:  = 

I'm 

,  G1  :  — 

Q 

/m  .  /m — 1  /m 

,  L J  . —  Wq  Lt  ]  — 

Q 

V 

ih) 

\ 

V pj 

V 

(3.18) 


where  we  recall  P  =  P  +  f3ATA.  From  these  definitions  it  follows 

uk+1  =  uk -G0(uk -u).  (3.19) 
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We  choose  P,  Q  and  /3  such  that  P  P  0  and  Q  P  0.  Hence  G  P  0  and  ||  •  ||g  is  a 
(semi-)norm.  The  definitions  of  the  matrix  G  and  the  G-norrn  are  similar  to  those  in 
the  work  [35].  Our  analysis  is  based  on  bounding  the  error  || uk  —  m*||g  and  estimate 
its  decrease. 


Lemma  3.1.  Under  Assumptions  3.1  and  3.2,  the  sequence  {wfc}  of  algorithm  2  obeys 

i) 

AT A  +  P(xk  -  xk+1)  e  df(xk+1 ),  (3.20) 

Bt  (A  -  [3A{xk  -  xk+1))  +  Q{yk  -  yk+1)  e  dg(yk+1).  (3.21) 

a) 

(xk+1  -  x*,  AT(  A  -  A*)  +  P(xk  -  xk+1))  >  Vf\\xk+1  -  Pf,  (3.22) 

(: Vk+l  -  y\  BT  (a  -  A*  -  (3A{xk  -  xfc+1))  +  Q(yk  -  yk+1 ))  >  ^||2/fc+1  -  z/*|[2. 

(3.23) 

Hi) 

A(xm  -  x')  +  B(yk+l  -  y')  =  i(A‘  -  A).  (3,24) 

iv) 

\\uk-u*\\2G-\\uk+1-u*\\2G  >  h(uk —u) +2v f\\xk+1  -> x* \\2 +2vg\\yk+1  —y* \\2 ,  (3.25) 


where 


Proof. 


h(uk  -  u)  :=\\xk  -  xk+%  +  ||/  -  yk*%  +  ^||A*  -  A||2 

(3.26) 

+  2(Afc- A )TA(xk -xk+1). 

i)  By  the  optimality  conditions  for  the  subproblems  of  Algorithm  2  and 


using  (3.15),  we  obtain  (3.20)  and  (3.21)  immediately. 
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ii)  By  the  convexity  of  /  (3.13),  combining  the  optimality  conditions  (3.9)  and  (3.20) 
yields  (3.22).  Similarly,  by  the  convexity  of  g  (3.14),  combing  the  optimality 
conditions  (3.10)  and  (3.21)  yields  (3.23). 


iii)  The  equation  (3.24)  follows  directly  from  (3.11)  and  (3.15). 


iv)  By  adding  (3.22)  and  (3.23)  and  using  (3.24),  we  have 

h\k  -A,  A  -  A*  -  (3A(xk  -  xk+1))  +  ( xk+1  -  x *,  P{xk  -  xk+1)) 

r' 

+  (yk+1  -  y\  Q{yk  -  yk+1))  >  vf\\xk+1  -  x*||2  +  vg\\yk+1  -  y*  II2,  (3.27) 


which  can  be  simplified  as 

(u  —  u*)TG\(uk  —  u) 

>(A{xk  -  xk+1 ),  Xk  -  A)  +  uf\\xk+1  -  x*||2  +  ug\\yk+1  -  y* ||2. 
By  rearranging  the  terms,  we  have 

(■ uk  -  M*)TGi(Mfc  -u)>  || uk  -  ufGl  +  {A(xk  -  xk+l ),  \k  -  A) 

+vf\\xk+1  -  x*\\2  +  vg\\yk+1  -  y*\\2. 
Then,  from  (3.19)  and  the  identity 


(3.28) 


(3.29) 


lla  —  c||  q  —  ||  b  —  c\\q  =  2  (a  —  c)TG(a  —  b)  —  ||  a  —  b\\G,  (3.30) 


it  follows  that 


|| uk  —  u*\\q  —  ||rifc+1  —  u*\\q  =  2 {uk  —  u*)T G\(uk  —  u)  —  \\G0(uk  —  u)\\q.  (3.31) 
Substituting  (3.29)  into  the  right-hand  side  of  the  above  equation,  we  have 


*  1 1 2 


1 


*  1 1 2 


\ZL  U  \\g  || li  Vb  ||^ 


>  2||m  —  u\\Gi  —  ||m  —  w||GlGo 
+  2  (A(xk  -  xk+1 ),  Xk  -  A)  +  2uf\\xk+1  -  x*||2  +  2ug\\yk+1  -  y*  ||2, 


(3.32) 
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and  thus  (3.25)  follows  immediately. 

□ 

In  the  next  theorem,  we  show  that  || uk  —  u* ||g  has  sufficient  descent.  Technically, 
it  is  done  through  bounding  h(uk  —  u)  from  zero  by  applying  the  Cauchy-Schwarz 
inequality  to  its  cross  term  2(Afe  —  \)T  A(xk  —  xk+1).  If  P  =  0,  a  more  refined  bound 
is  obtained  to  give  7  a  wider  range  of  convergence. 

Theorem  3.1  (Sufficient  descent  of  || uk  —  w*||g).  Assume  Assumptions  3.1  and  3.2. 
i)  When  P/0,  if  7  obeys 

(2  -  7 )P  y  (7  -  1)(3AtA  (3.33) 


(see  Remark  3.1  below  for  simplification) ,  then  there  exists  7  >  0  such  that 


k  *||2  ||  fe+1  *||2 


U  —  U  \\q  —  \\U  —  U 


>T]\\uk  -  uk+l \\2g  +  2vf\\xk+l  -  x*\\z  +  2vg\\yk+L  -  y 


IG 


(3.34) 


k+1  _  ™*ll2  O,,  |L,fc+1  _  II 2 


ii)  When  P  =  0,  if 


7  G  0, 


1  +  V5N 


(3.35) 


then  there  exist  7  >  0  such  that 

-  u*\\q  +  ^||rfc||2)  -  -  u*\\q  +  ^||rfc+1||2 

>  T]\\uk  -  uk+lfG  +  2uf\\xk  -  xk+1\\2  +  2uf\\xk+1  -  ^||2  +  2  ug\\yk+l  -  y*  ||2, 

(3.36) 

where  rk  is  the  residual  at  iteration  k: 


rk  :=  Axk  +  Byk  -  b. 
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If  we  set  7  =  1,  then  we  have 


\u  —  u 


*  1 1 2 
I G 


\U 


k+1  *  ||  2 


u*\\2g  >|| uk  -  Uk+1\\2G  +  2uf\\xk  -  xfc+1||2 


(3.37) 


+  2uj\\xk+1  -  x*\\2  +  2vg\\yk+l  -  y* ||2. 

Proof.  i)  By  the  Cauchy-Schwarz  inequality,  we  have 

2(Afc  -  A )TA{xk  -  xk+1)  >  ~\\A(xk  -  xk+1)\\2  -  p\\Xk  -  A||2,  Vp  >  0.  (3.38) 

Substituting  (3.38)  into  (3.25)  and  using  f(Xk  —  Xk+1 )  =  Xk  —  A,  we  have 


*  1 1 2 


k+1  *  ||2 


| Vb  Vb  ||^»  Vu  ||^f 


>  \\xk  -  xk+%_lATA  +  || yk  -  yk+1\\ l  + 

P 


2-7 

P 


p)  i||A‘-A‘+1||2  (3,39) 


+  2of\\xk+1  -  x*\\2  +  2vg\\yk+1  -  p*||2,  Vp  >  0. 

To  show  that  (3.34)  holds  for  a  certain  p  >  0,  we  only  need  P  —  1 ATA  y  0  and 
—  p  >  0  for  a  certain  p  >  0,  which  is  true  if  and  only  if  we  have  P  >-  APATA 
or,  equivalently,  (3.33). 


ii)  For  P  =  0,  we  first  derive  a  lower  bound  for  the  cross  term  (Xk  —  X)T A(xk  —  xk+1) . 
Applying  (3.20)  at  two  consecutive  iterations  with  P  =  0  and  in  light  of  the 
definition  of  A,  we  have 


|  AT[Xk~1  —  /3(Axk  +  Byk  —  b)]  e  df(xk), 
)  AT X  e  df(xk+1). 

The  difference  of  the  two  terms  on  the  left  in  (3.40)  is 


(3.40) 


AT[Xk~1-+Axk+Byk-b)- A]  =  AT(Xk-X)-(l-+pAT(Axk+Byk-b).  (3.41) 


By  (3.40),  (3.41)  and  (3.13),  we  get 

(. AT(Xk-X),xk-xk+1)-((l-+/3AT(Axk+Byk-b),xk-xk+1 )  >  uf \\xk-xk+1\\2 


(3.42) 
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to  which  applying  the  Cauchy-Schwarz  inequality  gives 
(Afc- A )TA(xk  -xk+1) 

>{^(Axk  +  Byk  -  b ),  (1  -  ^)^//3A(xk  -  xk+1))  +  uf\\xk  -  xk+1\\2 
>  -  ^-\\Axk  +  -  6||2  -  A(xk  -  xk+1)\\2  +  vf\\xk  -  xk+1\\2, 

(3.43) 

for  any  p  >  0.  Substituting  (3.43)  into  (3.25)  and  using  P  =  P  +  /3ATA  =  (3ATA 
and  the  definition  of  A,  we  have 

\\uk-u*\\2G+  -^\\Axk  +  Byk  -bf 

>  || uk+1  -  u*\\2g  +  ^||Acfc+1  +  Byk+1  -  b\\2 

+  0  ^2  -  7  -  \\Axk+l  +  Byk+l  -  b\\2  +  0  (l  -  (1  -  7)2p)  ||A(xfc  -  a:fc+1)||2 

+  || yk  -  yk+%  +  2i/f\\xk  -  zfe+1||2  +  2^/||xfe+1  -  x*||2  +  2vg\\yk+1  -  y* ||2. 

(3.44) 

To  prove  such  rj  >  0  exists  for  (3.36),  we  only  need  the  existence  of  p  >  0  such 
that  2  —  7—  -p  >  0  and  1  —  (1  —  7)2p  >  0,  which  holds  if  and  only  if  2  — 7  >  (1  —  7)2 
or,  equivalently,  7  6  (0,  1+2V'^). 

In  this  case  of  P  =  0,  if  we  set  7  =  1,  (3.43)  reduces  to  (Afc  —  \)T A(xk  —  xk+1)  > 
Vf  \\xk  —  xk+1\\2,  which  substituting  into  (3.25)  gives  (3.37)  with  rj  =  1. 

□ 

Now  the  sufficient  descent  of  \\uk  —  w*||q  in  Theorem  3.1  are  used  to  yield  the 
global  convergence  of  Algorithm  2. 


Theorem  3.2  (Global  convergence  of  Algorithm  2).  Assume  Assumptions  3.1  and 
3.2,  and  that  {uk}  of  Algorithm  2  is  bounded  (see  Remark  3.2  below).  For  any  7 
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satisfying  its  conditions  given  in  Theorem  3.1,  {wfc}  converges  to  a  KKT  point  u*  of 
(2.6)  in  the  G-norm,  namely, 

\\uk  -u*\\g  -)•  0. 

It  further  follows  that 

(a)  Xk  —$■  A*,  regardless  of  the  choice  of  P  and  Q; 

(b)  when  P^O,  xk  — >  x* ;  otherwise,  Axk  — >•  Ax*; 

(c)  when  Q  >-  0,  yk  — >  y* ;  when  Q  =  0,  Byk  — »  By* . 

Proof.  Being  bounded,  {uk}  has  a  converging  subsequence  {ukj}.  Let  u  =  lirn^oo  ukj . 
Next,  we  will  show  u  is  a  KKT  point.  Let  u*  denote  an  arbitrary  KKT  point. 

Consider  P  ^  0  first.  From  (3.34)  we  conclude  that  \\uk  —  u*\\q  is  monotonically 
nonincreasing  and  thus  converging,  and  due  to  rj  >  0,  \\uk  —  uk+1\\Q  — >  0.  In  light  of 
(3.18)  where  P  P  0  and  Q  P  0,  we  obtain  \k  —  Afc+1  — >  0  or  equivalently, 

rk  =  ( Axk+1  +  Byk+1  —  b)  — *  0,  as  k  — *  oo.  (3.45) 

Now  consider  P  =  0.  From  (3.36)  we  conclude  that  \\uk  —  u* \\2G  +  y||rfc||2  is 
monotonically  nonincreasing  and  thus  converging.  Due  to  r/  >  0,  \\uk  —  uk+1\\Q  — )■  0, 
so  Xk  —  Afc+1  — >  0  and  (3.45)  holds  as  well.  Consequently,  \\uk  —  u*\\2G  also  converges. 

Therefore,  by  passing  limit  on  (3.45)  over  the  subsequence,  we  have  for  P  =  0  or 
not: 

Ax  +  By  -  b  =  0.  (3.46) 

Recall  the  optimality  conditions  (3.21)  and  (3.20): 

Bt X  -  pBTA(xk  -  xk+1 )  +  Q(yk  -  yk+1)  e  dg(yk+1 ), 

AT X  +  P{xk  -  xk+1 )  e  df(xk+1). 
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Since  \\uk  —  uk+1\\Q  — y  0,  in  light  of  the  definition  of  G  (3.18),  we  have  the  following: 

•  when  P  =  0,  Afxk  —  xk+1 )  — >  0; 

•  when  P  7^  0,  the  condition  (3.33)  guarantees  P  >-  0  and  thus  xk  —  xk+1  — >  0; 

•  since  Q  P  0,  we  obtain  Q(yk  —  yk+1 )  — >•  0. 

In  summary,  j3BT A(xk  —  xk+1 ),  Q(yk  —  yk+1),  and  P(xk  —  aA’+1)  are  either  0  or 
converging  to  0  in  k,  no  matter  P  =  0  or  not. 

Now  on  both  sides  of  (3.21)  and  (3.20)  taking  limit  over  the  subsequence  and 
applying  Theorem  24.4  of  [52],  we  obtain: 

BT A  G  dg(y),  (3.47) 

AT A  G  df(x).  (3.48) 

Therefore,  together  with  (3.46),  u  satisfies  the  KKT  condition  of  (2.6). 

Since  u  is  a  KKT  point,  we  can  now  let  u*  =  u.  From  ukj  — y  u  in  j  and  the 
convergence  of  \\uk  —  u* \\2G  it  follows  || uk  —  u* ||^  — >  0  in  k. 

By  the  definition  of  G,  || uk  —  u*\\q  — >  0  implies  the  following: 

(a)  Xk  — >•  A*,  regardless  of  the  choice  of  P  and  Q ; 

(b)  when  P^O,  condition  (3.33)  guarantees  P  >~  0  and  thus  xk  — >  x*\  when  P  =  0, 
Axk  — y  Ax*] 

(c)  when  Q  y  0,  yk  — >•  y*\  when  Q  —  0,  Byk  — >  By*  following  from  (3.45)  and  (3.46). 

□ 

Remark  3.1.  Let  us  discuss  the  conditions  on  7.  If  P  >-  0,  the  condition  (3.33)  is 
always  be  satisfied  for  0  <  7  <  1.  However,  in  this  case,  7  can  go  greater  than  1, 
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which  often  leads  to  faster  convergence  in  practice.  If  P  f-  0,  the  condition  (3.33) 
requires  7  to  lie  in  (0,7)  where  0  <  7  <  1  depends  on  (3,  P,  and  ATA.  A  larger  f3 
would  allow  a  larger  7. 

In  particular,  in  Prox-liner  ADM  where  the  x-subproblem  is  solved  by  (3.2),  con¬ 
dition  (3.33)  is  guaranteed  by 


tP||2  +  7<2. 


(3.49) 


In  Gradient-descent  ADM  where  the  x-subproblem  has  the  form  of  (3.7),  a  sufficient 
condition  for  (3.33)  is  given  by 


mr 


+7  <  2. 


(3.50) 


1 


a. 


Remark  3.2.  The  assumption  on  the  boundedness  of  the  sequence  {uk}  can  be  guar¬ 
anteed  by  various  conditions.  Since  (3.34)  and  (3.36)  imply  that  ||«fc— u*\\2G  is  bounded, 
{uk}  must  be  bounded  if  P  >-  0  and  Q  >-  0.  Furthermore,  if  P  =  0  and  Q  =  0,  we 
have  the  boundedness  of{(Axk,Xk)}  (since  \\uk  —  u*\\q  is  bounded)  and  that  of  {Byk} 
by  (3.24),  so  in  this  case,  {uk}  is  bounded  if 


(i)  matrix  A  has  full  column  rank  whenever  P  =  0;  and 


(ii)  matrix  B  has  full  column  rank  whenever  Q  =  0. 


In  addition,  the  boundedness  of  {uk}  is  guaranteed  if  the  objective  functions  are  co 


ercive. 


3.4  Existing  Rate-of-Convergence  Results 


In  this  section,  we  review  some  existing  results  on  the  convergence  rate  of  ADMM. 
Although  there  is  extensive  literature  on  ADMM  and  its  applications,  there  are  very 
few  results  on  its  rate  of  convergence  until  the  very  recent  past.  Work  [24]  shows 
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that  for  a  Jacobi  version  of  the  ADMM  applied  to  smooth  functions  with  Lipschitz 
continuous  gradients,  the  objective  value  descends  at  the  rate  of  0(l/k)  and  that 
of  an  accelerated  version  descends  at  0{l/k2).  Then,  work  [25]  establishes  the  same 
rates  on  a  Gauss-Seidel  version  and  requires  only  one  of  the  two  objective  functions  to 
be  smooth  with  Lipschitz  continuous  gradient.  But  these  two  works  only  consider  the 
model  with  the  linear  constraint  coefficient  matrices  A  and  B  being  identity  matrix 
or  negative  identity  matrix. 

Lately,  work  [36]  shows  that  the  optimality  conditions  of  the  ADMM  based  on 
a  variational  inequality  converges  at  an  0(l/k )  rate  in  an  ergodic  sense.  Work  [35] 
shows  that  \\uk  —  ■ufc+1||2,  where  uk  :  =  ( xk,yk ,  Xk),  of  the  ADMM  converges  at  0(l/k). 
Work  [28]  proves  that  the  dual  objective  value  of  an  accelerated  version  of  ADMM 
descends  at  0{l/k2)  under  the  assumption  that  the  objective  functions  are  strongly 
convex  (one  of  them  being  quadratic). 

Besides  these  sublinear  rates,  several  linear  rates  have  also  been  established.  Work 
[16]  shows  that  the  ADMM  applied  to  linear  programming  converges  at  a  global  linear 
rate.  For  quadratic  programming,  work  [3]  presents  an  analysis  leading  to  a  conjecture 
that  the  ADMM  should  converge  linearly  near  the  optimal  solution. 

Work  [38]  proves  the  linear  convergence  of  ADMM  in  a  different  approach.  The 
linear  convergence  in  [38]  requires  that  the  objective  function  takes  a  certain  form 
involving  a  strictly  convex  function  and  the  step  size  for  updating  the  multipliers 
is  sufficiently  small  (which  is  impractical),  while  no  explicit  linear  rate  is  given.  Its 
recent  update  assumes  a  bounded  sequence  in  addition.  On  the  other  hand,  it  allows 
more  than  two  blocks  of  separable  variables  and  it  does  not  require  strict  convexity; 
instead,  it  requires  the  objective  function  to  include  f(Ex),  where  /  is  strictly  convex 
and  E  is  a  possibly  rank-deficient  matrix. 
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The  linear  convergence  of  AD  MM  was  also  established  in  the  context  of  Douglas- 
Rachford  Splitting  Mehtod  (DRSM)  [15]  and  Proximal  Point  Mehtod  (PPA)  [50].  It 
has  been  shown  that  ADMM  is  a  special  case  of  DRSM  [20],  and  further  DRSM 
is  a  special  case  of  PPA  [17].  Therefore,  the  linear  convergence  of  ADMM  can  be 
obtained  from  the  existing  linear  convergence  results  of  DRSM  and  PPA  [40,50]. 
However,  it  is  unclear  whether  those  results  can  apply  to  the  generalizations  to  ADMM 
(Algorithm  2).  It  remains  an  open  question  whether  these  generalizations  to  ADMM 
are  equivalent  to  the  iterations  of  some  firmly  nonexpansive  operators,  which  will  be 
left  for  future  research. 

In  the  next  subsection,  we  introduce  a  simple  technique  to  slightly  improve  the 
result  in  [35]  from  0(1/ k)  to  o(l/k).  In  Chapter  4,  we  will  present  detailed  analysis  to 
show  the  linear  convergence  of  the  Generalized  ADMM  under  a  variety  of  scenarios  in 
which  at  least  one  of  the  two  objective  functions  is  strictly  convex  and  has  Lipschitz 
continuous  gradient.  This  rate  is  stronger  than  the  sublinear  rates  such  as  0(1/ k) 
and  0(1 /k2)  and  is  given  in  terms  of  the  solution  error,  which  is  stronger  than  those 
given  in  terms  of  the  objective  error  in  [24,25,28],  violation  of  optimality  conditions 
in  [36],  and  the  solution  relative  change  in  [35].  On  the  other  hand,  [24,25,35,36]  do 
not  require  any  strictly  convex  functions.  The  fact  that  a  wide  range  of  applications 
give  rise  to  model  (2.6)  with  at  least  one  strictly  convex  functions  has  motivated  our 
work.  In  Subsection  4.2.3,  we  will  review  the  linear  convergence  result  of  DRSM 
in  [40]  and  show  that  our  analysis  yields  a  better  linear  rate,  in  addition  to  its  wider 
applicability  for  generalizations  of  ADMM. 
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3.5  On  o(l/ k)  Convergence  Rate 

In  this  section,  we  improve  the  0(1 /k)  convergence  rate  of  ADMM  established  in  [35] 
slightly  to  o(l /k).  The  proof  technique  is  based  on  an  elementary  lemma  (see  Lemma 
3.2  below)  which  can  be  used  to  improve  many  other  existing  0(1/ k)  convergence 
rate  to  o(l/k)  as  well. 

Lemma  3.2.  If  a  sequence  { a C  M  obeys:  (1)  >  0,-  (2)  Y^k=iak  <  +00;  (3)  a k 

is  monotonically  non-increasing,  then  we  have  ak  =  o(l/k). 

Proof.  By  the  assumptions,  we  have 


k  ■  a2k  <  O-k+l  +  0‘k+ 2  +  '  '  '  +  «2fc  — t  0 


as  k  — »  Too.  Therefore,  ak  =  o(l/k). 


□ 


Intuitively,  the  harmonic  sequence  {1/A;}  is  not  summable,  so  a  summable,  non¬ 
negative,  monotonic  sequence  shall  converge  faster  than  {1/A;}. 

Let  us  briefly  review  the  analysis  in  [35].  For  simplicity,  we  consider  the  classic 
ADMM  (Algorithm  1)  but  the  analysis  can  be  extended  to  the  Generalized  ADMM 
(Algorithm  2)  as  well.  In  [35],  the  quantity  \\uk  —  uk+1\\Q  is  used  to  measure  the 
optimality  of  the  iterations,  where 

\ 


f  dAT  A 


u  := 


V 


G  :  = 


O 


W  V  i1/ 

Note  that  y  is  essentially  not  part  of  the  G'-norm.  I11  fact,  y  can  be  regarded  as  an 
intermediate  variable  in  the  iterations  of  ADMM,  whereas  x  and  A  are  the  essential 
variables  [4],  Based  on  the  optimality  conditions  (3.20)  and  (3.21)  as  well  as  the 
relation  between  \k—\k+1  and  Axk+1+Byk+1— b,  it  is  easy  to  see  that  if  \\uk— uk+1\\Q  = 
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0  then  uk+1  satisfies  the  KKT  conditions  and  thus  is  optimal.  On  the  other  hand,  if 
\\Uk  —  uk+1\\Q  is  large,  then  there  is  big  violation  of  the  KKT  conditions  so  uk+l  is  very 
likely  far  away  from  being  a  solution.  Therefore,  the  quantity  \\uk  —  uk+1\\G  can  be 
viewed  as  a  measure  of  the  distance  between  uk+l  and  the  solution  set.  Furthermore, 
it  seems  reasonable  to  use  the  squared  term  \\uk  —  uk+1  ||q  rather  than  || uk  —  uk+1\\c 
to  measure  the  convergence  rate  of  ADMM.  By  the  variational  inequality  (3.12)  as 
well  as  the  variational  characterization  of  the  ADMM  iterations  (refer  to  [35]  for 
more  details),  we  can  see  that  the  quadratic  term  \\uk  —  uk+1\\G  closely  relates  to  the 
objective  error  f(xk+1)  +  g(yk+1)  —  f{x*)  —  g{y*)-  Therefore,  in  the  similar  sense  that 
the  objective  error  is  commonly  used  to  measure  convergence  rates,  it  is  reasonable 
to  use  the  quadratic  term  \\uk  —  uk+1\\G  as  well. 

The  work  [35]  proves  that  \\uk  —  uk+1 1||.  converges  to  zero  at  a  rate  of  0(l/k). 
The  key  steps  of  the  proof  are  to  establish  the  following  properties. 

Lemma  3.3.  Let  {uk}  be  the  sequence  generated  by  Algorithm  1. 

(i)  The  sequence  {wfc}  is  contractive: 

||  uk  -  u*\\2g  -  ||  uk+1  -  u*fG  >  ||  uk  -  uk+1\\2G.  (3.51) 

(ii)  The  sequence  \\uk  —  uk+l\\2G  is  monotonically  non-increasing: 

|| uk  -  uk+1\\G  <  || u1*-1  -  uk\\2G.  (3.52) 

The  contraction  property  (3.51)  follows  directly  from  (3.37).  It  has  been  long 
established  in  the  literature,  which  dates  back  to  [21,23].  For  the  sake  of  completeness, 
we  shall  prove  the  monontonicity  property  (3.52).  Inspired  by  [35],  we  provide  a  much 
shorter  proof  than  the  one  in  [35]. 
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Proof  of  (3.52).  To  simplify  the  notation,  we  let 

Axfc+1  =  xk  -  xk+1,  Ayk+1  =  yk-  yk+\  A\k+1  =  \k  -  \k+l . 

Let  us  recall  the  part  (i)  of  Lemma  3.1.  We  have  the  following  optimality  conditions 
for  the  subproblems: 


AT \k+1  e  df(xk+1 ),  (3.53) 

Bt  (\k+1  -  pA(xk  -  xk+1))  e  dg{yk+l).  (3.54) 

By  the  monotonicity  of  sub  differential  df,  combining  (3.53)  at  k-  th  and  (k  +  l)-th 
iterations  yields 

(Axfc+1,  ATAAfc+1)  >  0.  (3.55) 

Similarly,  using  the  monotonicity  of  dg  for  (3.54)  at  k- th  and  (k  +  l)-th  iterations, 
we  obtain 

(Ayfc+1,  Bt  AXk+1  -  fiBTA( Axfc  -  Axfc+1))  >  0.  (3.56) 

Adding  the  above  two  inequalities  together,  we  have 

(AAxfc+1  +  Ayfc+1)TAAfc+1  -  (3(BAyk+1)T A(Ax.k  -  Axfc+1)  >  0.  (3.57) 

Note  that  we  have 

AAxfc+1  +  BAyk+1  =  ^(AAfc  -  AAfc+1).  (3.58) 

Then  we  substitute  Ayfc+1  using  (3.58),  and  thereby  (3.57)  becomes 


^(AAfc-AAfc+1)TAAfc+1-(AAfc-AAfc+1-/L4Axfc+1)TA(Axfc-Axfc+1)  >  0.  (3.59) 
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To  further  simplify  the  notation,  we  let 


afc  =  \//T4Axfc,  bk 


1 

7? 


AXk. 


After  rearranging  the  terms,  (3.59)  can  be  rewritten  as 

(afc  +  bfc)T(afe+1  +  bk+1)  -  (ak)Tbk  -  (afc+1)Tbfc+1 
>  ||afc+1||2  +  ||bA'+1||2  =  || uk  -  uk+l\\2G. 

By  the  Cauchy-Schwarz  inequality,  we  have 


(3.60) 


(afc  +  bfc)T(afc+1  +  bk+1)  <  (||afc  +  bfc||2  +  ||afc+1  +  bfc+1||2)/2,  (3.61) 


or  equivalently, 


(afc  +  bk)T(ak+1  +  bfc+1)  -  (ak)Tbk  -  (afc+1)Tbfc+1 
<  (||afc||2  +  ||bfc||2  +  ||afc+1||2  +  ||bfc+1||2)/2. 

Applying  the  above  inequality  to  the  left-hand  side  of  (3.60),  we  have 


(3.62) 


\u  —  u 


\G  — 


<  ( 1 1 afc 1 1 2  +  ||bfc||2  +  || afc 


+  ||bfc+1||2)/2 


=  u 


k- 1  „.k  ||  2 


mkHg+  \\uk  -uk+1fG)  / 2, 


and  thus  (3.52)  follows  immediately. 


(3.63) 


a 


We  are  now  ready  to  improve  the  convergence  rate  from  0(l/k)  to  o{l/k). 


Theorem  3.3  (Convergence  Rate  of  o{l/k)  ).  Let  {uk}  be  the  sequence  generated  by 
Algorithm  1.  Then  || uk  —  uk+1  \\2G  =  o{l/k).  Therefore, 

•  \\Axk  —  Axk+l  ||2  +  ||  Byk  —  Byk+1 1|2  +  ||Afc  —  Afc+1||2  =  o(l/k); 

•  ||Aa:fc+1  +  Byk+1  —  6||2  =  o(l/k). 
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Proof.  By  (3.51),  we  have 

n 

|| uk  —  uk+l\\2G  <  Ijw1  —  u*\\q  —  || un+l  —  u*\\q,  Vn.  (3.64) 

k= 1 

As  we  have  shown,  \\un+l  —  u*\\q  — >  0  as  n  — >■  oo.  Therefore,  YlkLi  \\uk  —  uk+1\\^  <  oo. 

By  (3.52),  \\uk  —  uk+1\\2G  is  monotonically  non- increasing.  Obviously,  ||wfc_ uk+1\\2G 
is  also  nonnegative.  Therefore,  \\uk  —  uk+l ||g.  =  o{l/k)  follows  from  Lemma  3.2 
immediately.  By  the  definition  of  G,  we  have  \\Axk  —  Axfc+1||2  =  o{l/k)  and  ||Afc  — 
Afc+1||2  =  o{l/k).  By  (3.58),  we  have  || Byk  —  Byk+l  ||2  =  o(l/k )  as  well.  Finally, 
|| Axk+1  +  Byk+1  -  fe||2  =  || Afc  -  \k+1\\2//32  =  o(l/k).  © 

Remark  3.3.  The  proof  technique  based  on  Lemma  3.2  can  be  applied  to  improve 
some  other  existing  convergence  rates  of  0(1/ k)  (e.g.,  [12,33])  to  o(l/k)  as  well. 
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Chapter  4 

Linear  Convergence  of  Generalized  ADMM 

The  main  goal  of  this  chapter  is  to  show  that  the  Generalized  ADMM  (Algorithm  2) 
applied  to  the  problem  (2.6)  has  global  linear  convergence  0(l/ck)  for  some  c  >  1, 
provided  that  one  of  the  two  objective  functions,  say,  /  is  strictly  convex,  V/  is 
Lipschitz  continuous,  and  matrices  A  and  B  have  certain  rank  conditions.  Note  that 
when  restricted  to  a  compact  set,  a  strictly  convex  function  is  strongly  convex.  So, 
as  long  as  an  algorithm  generates  a  bounded  sequence,  strict  convexity  is  effectively 
strong  convexity.  For  simplicity,  we  use  “strong  convexity”  or  “strongly  convex”  in 
the  remaining  of  the  chapter. 

The  chapter  is  organized  as  follows.  Section  4.1  summarizes  the  linear  convergence 
results  that  we  shall  establish  under  various  scenarios.  The  detailed  analysis  is  then 
given  in  Section  4.2.  Section  4.3  discusses  several  interesting  applications  that  are 
covered  by  our  linear  convergence  results.  Section  4.4  presents  numerical  results  to 
demonstrate  the  linear  convergence  behavior  of  ADMM  in  practice. 

4.1  Summary  of  Results 

We  shall  establish  the  global  linear  convergence  of  Algorithm  2  that  are  described  in 
Tables  4.1  and  4.2.  Table  4.1  summarizes  the  four  scenarios  under  which  we  study 
the  linear  convergence  of  Algorithm  2,  and  Table  4.2  specifies  the  linear  convergent 
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Table  4.1  :  Four  scenarios  leading  to  linear  convergence 


scenario 

strongly 

convex 

Lipschitz 

continuous 

full  row 

rank 

additional  assumptions 

1 

/ 

v/ 

A 

if  Q  P  0,  B  has  full  column  rank 

2 

f,9 

v/ 

A 

3 

f 

V/,  V<7 

- 

B  has  full  column  rank 

4 

fi  9 

V/,  V<7 

- 

quantities  for  different  types  of  matrices  P,  P,  and  Q ,  where 

P  :=  P  +  /3AT A 

is  defined  for  the  convenience  of  convergence  analysis.  P  =  0  and  Q  —  0  correspond 
to  exactly  solving  the  x-  and  y- subproblems,  respectively.  Although  P  =  0  and  P  >~  0 
are  different  cases  in  Table  4.2,  they  may  happen  at  the  same  time  if  A  has  full  column 
rank;  if  so,  apply  the  result  under  P  >-  0,  which  is  stronger. 

The  conclusions  in  Table  4.2  are  the  quantities  that  converge  either  Q-linearly 
or  R-linearly* *.  Q-linear  convergent  quantities  are  the  entireties  of  multiple  variables 
whereas  R-linear  convergent  quantities  are  the  individual  variables  xk,  yk,  and  Xk. 

Four  scenarios  of  global  linear  convergence.  In  scenario  1,  only  function 
/  needs  to  be  strongly  convex  and  having  Lipschitz  continuous  gradient;  there  is  no 

*Suppose  a  sequence  {ufc}  converges  to  it*.  We  say  the  convergence  is  (in  some  norm  ||  ■  ||) 

•  Q-linear ,  if  there  exists  y  €  (0, 1)  such  that  <  y 

•  R-linear ,  if  there  exists  a  sequence  {crk}  such  that  \\uk  —  it*||  <  ak  and  crk  — >  0  Q-linearly. 


42 


Table  4.2  :  Summary  of  linear  convergence  results 


case 

P,P 

Q 

any  scenario  1-4 

Q- linear  convergence 

R-linear  convergence 

1 

P  =  0 

=  0 

(Axk,  Xk ) 

2 

Py  0 

=  0 

(xk,  Xk ) 

xk,  ( yk  or  Byk)*,  Xk 

3 

P  =  0 

>-  0 

( Axk,yk ,  Xk ) 

4 

Py  0 

>-  0 

(xk,  yk,  Xk ) 

*  In  cases  1  and  2,  scenario  1,  R-linear  convergence  of  yk  requires  full 


column  rank  of  B\  otherwise,  only  Byk  has  R-linear  convergence. 


assumption  on  g  besides  convexity.  On  the  other  hand,  matrix  A  must  have  full 
row  rank.  Roughly  speaking,  the  full  row  rank  of  A  makes  sure  that  the  error  of 
\k  can  be  bounded  just  from  the  x-side  by  applying  the  Lipschitz  continuity  of  V/. 
One  cannot  remove  this  condition  or  relax  it  to  the  full  row  rank  of  [A,  B ]  without 
additional  assumptions.  Consider  the  example  of  A  =  [1;  0]  and  B  =  [0;  1],  where 
[A,R]  =  /  has  full  rank.  Since  A k,  which  is  the  2nd  entry  of  Xk,  is  not  affected  by  / 
or  {xfc}  at  all,  there  is  no  way  to  take  advantages  of  the  Lipschitz  continuity  of  V/  to 
bound  the  error  of  A^.  In  general,  without  the  full  row  rank  of  A,  a  part  of  Xk  needs 
to  be  controlled  from  the  y-side  using  properties  of  g. 

Scenario  2  adds  the  strong  convexity  assumption  on  g.  As  a  result,  the  remark  in 
case  1  regarding  the  full  column  rank  of  B  is  no  longer  needed. 

Both  scenarios  3  and  4  assume  that  g  is  differentiable  and  Vg  is  Lipschitz  con¬ 
tinuous.  As  a  result,  the  error  of  Xk  can  be  controlled  by  taking  advantages  of  the 
Lipschitz  continuity  of  both  V/  and  Vg,  and  the  full  row  rank  assumption  on  A  is 
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no  longer  needed.  On  the  other  hand,  scenarios  3  and  4  exclude  the  problems  with 
non-differentiable  g.  Compared  to  scenario  3,  scenario  4  adds  the  strong  convexity 
assumption  on  g  and  drops  the  remark  on  the  full  column  rank  of  B. 

Under  scenario  1  with  Q  y  0  and  scenario  3,  the  remarks  in  Table  4.1  are  needed 
essentially  because  yk  gets  coupled  with  xk  and  \k  in  certain  inequalities  in  our 
convergence  analysis.  The  full  column  rank  of  B  helps  bound  the  error  of  yk  by  those 
of  xk  and  Xk. 

Four  cases.  When  P  =  0  (corresponding  to  exactly  solving  the  x-subproblem), 
we  have  P  P  0  and  only  obtain  linear  convergence  in  Ax.  However,  when  P  >~  0, 
linear  convergence  in  x  is  obtained. 

When  Q  =  0  (corresponding  to  exactly  solving  the  y-subproblcm) ,  y  is  not  part 
of  the  Q-linear  convergent  joint  variable.  But,  when  Q  >~  0,  y  becomes  part  of  it. 

Remark  4.1.  Indeed,  we  only  use  f  and  g  ’s  properties  over  the  compact  sets  including 
{xk}  and  {yk},  not  globally;  thus,  strict  convexity  can  replace  strong  convexity. 

4.2  Linear  Convergence 

To  establish  the  linear  convergence,  we  take  three  steps.  First,  using  (3.34)  for  P  ^  0 
and  (3.36)  for  P  =  0,  as  well  as  the  assumptions  in  Table  4.1,  we  show  that  there 
exists  5  >  0  such  that 

II uk  -  u*\\l  >  (1  +  <5)||«fc+1  -  u*\\l,  (4.1) 

where  u*  =  lirn^oc  uk  is  given  by  Theorem  3.2.  We  call  (4.1)  the  Q-linear  convergence 
of  {uk}  in  G- (semi) norm.  Next,  using  (4.1)  and  the  definition  of  G,  we  obtain  the 
Q-linear  convergent  quantities  in  Table  4.2.  Finally,  the  R-lincar  convergence  in  Table 


2  is  established. 
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4.2.1  Analysis 

We  first  assume  7  =  1,  which  allows  us  to  simplify  the  proof  presentation.  At  the  end 
of  this  subsection,  we  explain  why  the  results  for  7  =  1  can  be  extended  to  7  7^  1  that 
satisfies  the  conditions  of  Theorem  3.1.  Note  that  for  7  =  1,  we  have  (3.37)  instead 
of  (3.36).  Hence,  no  matter  P  =  0  or  P  7^  0,  both  inequalities  (3.34)  and  (3.37)  have 
the  form 

\\uk-u*\\ l  -  \\uk+1  —  u*\\q  >  C, 

where  C  stands  for  their  right-hand  sides.  To  show  (4.1),  it  is  sufficient  to  establish 

C  >  5\\uk+1  -  u*\\2G.  (4.2) 

The  challenge  is  that  \\uk+1  —  u*\\G  is  the  sum  of  ||;rfc+1  —  x*\\2p,  || yk+1  —  y*\\q,  and 
^||Afc+1  —  A* || 2,  but  C  does  not  contain  terms  like  \\yk+1  —  y*\\2  and  ||Afc+1  —  A* ||2. 
Therefore,  we  shall  bound  ||Afc+1  —  A*||2  and  \\yk+l  —  y*\\2Q  from  the  existing  terms 
in  C  or  using  the  strong  convexity  assumptions.  This  is  done  in  a  series  of  lemmas 
below. 

Lemma  4.1  (For  scenario  1,  cases  3  and  4,  and  scenario  3).  Suppose  that  B  has  full 
column  rank.  For  any  /i\  >  0,  we  have 

|| yk+1  -  y*\\2  <  Cl\\xk+1  -  x*||2  +  c2||Afc  -  Afc+1||2,  (4.3) 

where  a  :=  (1  +  ^)||A||2  •  X^in(BTB)  >  0  and  c2  :=  (1  +  Pi)(h)~2  '  K^n(BT B)  >  °- 

Proof.  By  (3.24),  we  have  \\B(yk+1  —  y*)\\2  =  ||A(a:A:+1  —  x*)  —  ^(Afe  —  Afe+1)||2.  Then 
apply  the  following  inequality  (or  the  Cauchy- Schwarz  inequality): 

||m  +  n||2  <  Tl  +  ||w||2  +  (1  +  /ii)||u||2,  V/^i  >  0,  (4.4) 


to  its  right-hand  side. 


a 
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Lemma  4.2  (For  scenarios  1  and  2).  Suppose  that  V/  is  Lipschitz  continuous  with 
constant  Lf  and  A  has  full  row  rank.  For  any  /i2  >  1,  we  have 

|| A  -  A* || 2  <  c3||a;fc+1  -  x*||2  +  c4 \\xk  -  xk+1\\2,  (4-5) 

where  c3  :=  L){  1  -  ^)_1A“(n(7L4T)  >  0  and  c4  :=  fi2\\P\\2X^a(AAT)  >  0. 

Proof.  By  the  optimality  conditions  (3.9)  and  (3.20)  together  with  the  Lipschitz  con¬ 
tinuity  of  V/,  we  have 

PT(A  -  A*)  +  P(xk  -  xk+1)\\2  =  || Wf{xk+1)  -  V/(x*) ||2  <  L)\\xk+1  -  x* || 2.  (4.6) 

Then  apply  the  following  basic  inequality: 

|| u  +  u||2  >  ^1  -  ||n||2  +  (1  -  /n2)||u||2,  V/x2  >  0,  (4.7) 

to  the  left  hand  side  of  (4.6).  We  require  /r2  >  1  so  that  (1  —  -A)  >  0.  □ 

Lemma  4.3  (For  scenarios  3  and  4).  Suppose  V/  and  Vg  are  Lipschitz  continuous, 
and  the  initial  multiplier  A0  is  in  the  range  space  of  [A,  B]  (letting  A0  =  0  suffices). 
For  any  p3  >  1  and  /i4  >  0,  we  have 

||A  —  A*||2  <  c5\\xk  —  xk+1\\2  +  C6\\yk  —  yk+1\\q  +  c7\\xk+1  “^*||2  +  c8||7/fe+1  —  y*||2,  (4.8) 

where  c5  =  /i3(l  +  -A)\\[PT , -f3AT B]\\2c  >  0,  c6  =  /i3(l  + /i4)||Q||2c  >  0,  c7  = 
(1  —  ^-)_1L2c  >  0,  c8  =  (1  —  ^-)_1L2c  >  0,  and  c  >  0  is  as  follows: 

•  If  the  matrix  [A,  B]  has  full  row  rank,  c  :=  A“(n([AL,  B][A,  B]T)  >  0. 


•  Otherwise,  rank([A,  B ])  =  r  <  p.  Without  loss  of  generality,  assuming  the  first 
r  rows  of  [A,  B]  (denoted  by  [Ar,Br])  are  linearly  independent,  we  have 


[A,  B] 


I 

L 


[Ari  Br\, 
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where  I  6  Mrxr  is  the  identity  matrix  and  L  6  r)xr.  Let  E  (/  + 
LTL)[Ar,5r];  andc:=  A-Jn(EET)||/ +  LTL||  >0. 


Proof.  We  first  show  that 


|| A  —  A*| 2  <  c- 

Bt 

(A -A*) 

(4.9) 


where  c  >  0  is  defined  above.  If  [A,  B]  has  full  row  rank,  it  is  trivial.  Now,  suppose 
[Al,L?]  is  rank  deficient,  i.e.,  rank  ([.A,  B])  =  r  <  p.  Without  loss  of  generality,  we 
assume  the  first  r  rows  of  [A,  B]  (denoted  by  [AriBr])  are  linearly  independent,  and 
thus 


[A,B] 


I 

L 


[Ari  Br\. 


By  the  update  formula,  if  the  initial  multiplier  A0  is  in  the  range  space  of  [A,  B ],  then 
Xk,  k  =  1,2,...,  always  stay  in  the  range  space  of  [A,  B ],  so  do  A  and  A*.  It  follows 


that 


and  thus 


I 

I 

I 

L 

K,  a  = 

L 

Ar,  A*  = 

L 

AT 

Af 

Bt 

(A  -  A*)  = 

BTr_ 

(J  +  LTL)(Ar 


Since  E  :=  (I+LT L)[Ar,  Br\  has  full  row  rank,  we  have  c  :=  \f^in(EET)\\I+LTL\\  >  0 
and  (4.9)  follows  immediately. 

Combining  the  optimality  conditions  (3.10),  (3.9),  (3.20),  and  (3.21)  together  with 
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the  Lipschitz  continuity  of  V/  and  Vg,  we  have 


P 

(. xk  -  xk+1)  + 

0 

(: vk  -  yk+1) 

(A-A*)  + 

Bt 

—/3BTA 

Q 

=  11  V/(xfc+1)  -  V/(x*) ||2  +  ||V^(/+1)  -  Vc/(y*)||2 
<L2||xfc+1  —  £*||2  +  -hg||?/fe+1  —  dis¬ 
similarly,  we  apply  the  basic  inequalities  (4.4)  and  (4.7)  to  its  left  hand  side  and  use 
(4.9).  □ 

With  the  above  lemmas,  we  now  prove  the  following  main  theorem  of  this  subsec¬ 
tion. 


Theorem  4.1  (Q-linear  convergence  of  \\uk  —  w*||q).  Under  the  same  assumptions  of 
Theorem  3.2  and  7  =  1,  for  all  scenarios  in  Table  f.l,  there  exists  6  >  0  such  that 
(4.1)  holds. 


Proof.  Consider  the  case  of  P  =  0  and  the  corresponding  inequality  (3.37).  In  this 
case  P  =  /3ATA  P  0.  Let  C  denote  the  right-hand  side  of  (3.37). 

Scenarios  1  and  2  (recall  in  both  scenarios,  /  is  strongly  convex,  V/  is  Lipschitz 
continuous,  and  A  has  full  row  rank).  Note  that  C  contains  the  terms  on  the  right 
side  of  (4.5)  with  strictly  positive  coefficients.  Hence,  applying  Lemma  4.2  to  C,  we 
can  obtain 


C  >(cg||x 


k+ 1 


—  X 


+  cw\\y 


k+ 1 


-y 


+  cn||Afc+1  —  A*  || 2) 


(4.11) 


+  (c12\\yk  -  yk+1\\2Q  +  c13||Afc  -  Afc+1||2) 

with  Cg,  Cn  >  0,  cio  =  2 ug  >  0,  c\2  =  rj  >  0,  and  C13  =  rj / (j3j)  >  0.  We  have  eg  >  0 
because  only  a  fraction  of  2uf\\xk+1  —  x*||2  is  used  with  Lemma  4.2;  cg||a;fc+1  —  x*\\2 


is  unused  so  it  stays.  The  same  principle  is  applied  below  to  get  strictly  positive 
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coefficients,  and  we  do  not  re-state  it.  For  proof  brevity,  we  do  not  necessarily  specify 
the  values  of  ct. 

For  scenario  1  with  Q  =  0,  \\uk+1  —  u*\\2G  =  \\xk+1  —  x*\\2p  +  -b||Afc+1  —  A*||2. 
Since  ||a;fe+1  —  P||2  >  Amax(P)_1||a;fc+1  —  x*\\ 2p,  (4.2)  follows  from  (4.11)  with  5  = 
min{c9A“^x(P),  Cn/Py}  >  0. 

For  scenario  1  with  Q  y  0,  \\uk+1—u*\\G  =  \\xk+1  —  x*\\2p+\\yk+1  — x*||q  +  -^;||AA:+1  — 
A*||2.  Since  cio  is  not  necessarily  strictly  positive,  we  shall  apply  Lemma  4.1  to  (4.11) 
and  obtain 

C>(Cl4xk+1-x*\\2  +  Cl4yk+1-y*\\2  +  cu\\Xk+1-X*\\2)  +  c12\\yk-yk+1\\2Q  (4-12) 

where  Cu,  Ci5,  Cn,  c12  >  0.  So,  it  leads  to  (4.1)  with 

6  =  min{c14A-^(JP),ci5A-^(Q),Cii^7}  >  0. 

Scenario  2  (recall  it  is  scenario  1  plus  that  g  is  strongly  convex).  We  have  Ci0  = 
2 vg  >  0  in  (4.11),  which  gives  (4.1)  with  S  =  min{c9A^(P), ci0A“^(<3), cn^}  >  0. 
Note  that  we  have  used  the  convention  that  if  Q  =  0,  then  A“^x(<5)  =  oo. 

Scenario  3  (recall  /  is  strongly  convex,  both  V/  and  Vg  are  Lipschitz  continuous). 
We  apply  Lemma  4.1  to  get  ||yfc+1  —  y*  ||2  with  which  we  then  apply  Lemma  4.3  to 
obtain 

C  >  c16||a;fc+1  -  x*\\2  +  c17\\yk+1  -  y*\\2  +  c18||Afc+1  -  A*||2,  (4.13) 

where  ci6,ci7,ci8  >  0  and  the  terms  ||a:fc  —  xk+1\\2,  \\yk  —  yk+1 1|2,  and  ||Afc  —  Afc+1||2 
with  nonnegative  coefficients  have  been  dropped  from  the  right-hand  side  of  (4.13). 
From  (4.13),  we  obtain  (4.1)  with  S  =  min{ci6A“aX(P),  ci7A“^x((5),  Cis/^y}  >  0. 

Scenario  4  (recall  it  is  scenario  3  plus  that  g  is  strongly  convex).  Since  cn  = 
2 vg  >  0  in  (4.11),  we  can  directly  apply  Lemma  4.3  to  get  (4.1)  with  5  >  0  in  a  way 


similar  to  scenario  3. 
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Now  consider  the  case  of  P  7^  0  and  the  corresponding  inequality  (3.32).  Inequal¬ 
ities  (3.32)  and  (3.37)  are  similar  except  (3.37)  has  the  extra  term  ||xfc  —  xk+1\\2  with 
a  strictly  positive  coefficient  in  its  right-hand  side.  This  term  is  needed  when  Lemma 
4.2  is  applied.  However,  the  assumptions  of  the  theorem  ensure  P  >-  0  whenever 
P  ^  0.  Therefore,  in  (3.32),  the  term  \\uk  —  uk+l\\2G ,  which  contains  ||xfc  —  xfc+1||^, 
can  spare  out  a  term  cig||xfc  —  xfc+1||2  with  cqg  >  0.  Therefore,  following  the  same 
arguments  for  the  case  of  P  =  0,  we  get  (4.1)  with  certain  6  >  0.  □ 

Now  we  extend  the  result  in  Theorem  4.1  (which  is  under  7  =  1)  to  7  7^  1  in  the 
following  theorem. 

Theorem  4.2.  Under  the  same  assumptions  of  Theorem  3.2  and  7  /  1,  for  all 
scenarios  in  Table  f.l, 

i)  if  P  ^  0,  there  exists  5  >  0  such  that  (4.1)  holds; 

ii)  if  P  =  0,  there  exists  5  >  0  such  that 

|| uk  -  u*\\G  +  ^\\rk\\2  >  (1  +  8)  ^||«fc+1  -  u*\\G  +  ^||rfc+1||2^  .  (4.14) 

Proof.  When  7  7^  1,  which  causes  Afc+1  7^  A.  We  shall  bound  ||Afc+1  —  A*  || 2  but  Lemmas 
4.2  and  4.3  only  give  bounds  on  ||  A  —  A* ||2.  Noticing  that  (A  —  A*)  —  (Afc+1  —  A*)  =  A  — 
yfc+i  _  ('y_i)rfe+1  and  C  contains  a  strictly  positive  term  in  ||  Afc  —  Afc+1||2  =  72||rfc+1||2, 
we  can  bound  ||Afc+1  —  A* || 2  by  a  positively  weighted  sum  of  ||  A  —  A*  || 2  and  || \k  —  Afc+1||2. 

If  P  7^  0,  the  rest  of  the  proof  follows  from  that  of  Theorem  4.1. 

If  P  =  0,  7  7^  1  leads  to  (3.36),  which  extends  \\ul  —  u*\\q  in  (3.37)  to  \\ul  —  u*\\^  + 
^||r*||2,  for  i  —  k,  k  +  1.  Since  C  contains  ||Afc  —  Afc+1||2  =  72||rfc+1||2  with  a  strictly 
positive  coefficient,  one  obtains  (4.14)  by  using  this  term  and  following  the  proof  of 


Theorem  4.1. 


B 
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Q-linear  convergent  quantities 

From  the  definition  of  G,  which  depends  on  P  and  Q,  it  is  easy  to  see  that  the  Q- 
linear  convergence  of  uk  =  (xk\  yk\  \k)  translates  to  the  Q-linear  convergence  results 
in  Table  4.2.  For  example,  in  case  1  (P  =  0  and  Q  =  0),  ||wfc+1  —  u*\\q  =  ||a:fc+1  — 
x*\\2p  +  ^||Afc+1  —  A* || 2,  where  P  =  P  +  f3AAT  =  (3ATA.  Hence,  ( Axk ,  Xk)  converges 
Q-linearly.  Examining  ||iifc+1  —  u*\\q  gives  the  results  for  cases  2,  3,  4. 

R-linear  convergent  quantities 

By  the  definition  of  R-linear  convergence,  any  part  of  a  Q-linear  convergent  quantity 
converges  R- linearly.  For  example,  in  case  1  (P  =  0  and  Q  —  0),  the  Q-lincar 
convergence  of  ( Axk ,  Xk)  in  Table  4.2  gives  the  R-linear  convergence  of  Axk  and  Xk. 
Therefore,  to  establish  Table  4.2,  it  remains  to  show  the  R-linear  convergence  of  xk  in 
cases  1  and  3  and  that  of  yk  in  cases  1  and  2.  Our  approach  is  to  bound  their  errors 
by  existing  R-linear  convergent  quantities. 

Theorem  4.3  (R-linear  convergence).  The  following  statements  hold. 

i)  In  cases  1  and  3,  if  Xk  converges  R-linearly,  then  xk  converges  R-linearly. 

ii)  In  cases  1  and  2,  scenario  1,  if  Xk  and  xk  both  converge  R-linearly,  then  Byk 
converges  R-linearly.  In  addition,  if  B  has  full  column  rank,  then  yk  converges 
R-linearly. 

Hi)  In  cases  1  and  2,  scenarios  2-f,  if  Xk  and  xk  both  converge  R-linearly,  then  yk 
converges  R-linearly. 

Proof.  We  only  show  the  result  for  7  =  1  (thus  A  =  Afc+1);  for  7  7^  1  (thus  A  7^  Afc+1), 
the  results  follow  from  those  for  7  =  1  and  the  R-lincar  convergence  of  ||  A  —  Afc+1||2, 


51 


which  itself  follows  from  (3.16)  and  the  R-linear  convergence  of  Xk  (thus  that  of 
Xk  _  Afc+i)_ 

i)  By  (3.22)  and  P  =  f3ATA ,  we  have  Uf\\xk+1  —  x*\\2  <  \\A\\ ||xfc+1  —  x*||  || Afc+1  —  A*||, 
which  implies 

|| xk+1  -  ;r*||2  <  M^||Afc+1  -  A*||2.  (4.15) 

vf 

ii)  The  result  follows  from  (3.24). 

iii)  Scenario  3  assumes  the  full  column  rank  of  B ,  so  the  result  follows  from  (3.24). 
In  scenarios  2  and  4,  g  is  strongly  convex.  Recall  (3.23)  with  A  =  Afc+1: 

(yk+1  -  y*,  BT  (Xk+1  -  X*  -  /3A{xk  -  xfc+1))  +  Q(yk  -  yk+l ))  >  vg\\ yk+1  -  y* ||2. 

(4.16) 

By  the  Cauchy-Schwarz  inequality  and  0  =  0.  we  have 

ug\\yk+1  -y*\\  <  \\B\\\\Xk+1  -  X*  -  (3A(xk  -  xk+1)\\.  (4.17) 

Therefore,  the  result  follows  from  the  R-linear  convergence  of  xk  and  Xk . 

□ 


4.2.2  Explicit  Formula  of  Linear  Rate 

To  keep  the  proof  of  Theorem  4.1  easy  to  follow,  we  have  avoided  giving  the  explicit 
formulas  of  q’s  and  thus  also  those  of  5.  To  give  the  reader  an  idea  what  quantities 
affect  6,  we  now  provide  an  explicit  formula  of  6  for  the  classic  AD  MM  (i.e. ,  case  1 
with  7  =  1)  under  scenario  1. 
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Corollary  4.1  (Convergence  rate  of  classic  ADMM  under  scenario  1).  Under  As¬ 
sumptions  3.1  and  3.2,  for  scenario  1  in  Table  f.l,  the  sequence  {wfe}  of  Algorithm  1 
satisfies  (4.1)  with 


6  =  2 


rm  IP 


+ 


-l 


\  Vf  /3Amin(/L4T)  / 

In  particular,  choosing  (3  =  a\\2k^1(AAt)  yields  the  largest  6: 

1 


(4.18) 


^max  , - j 

KA^fKf 


(4.19) 


where  ka  '■=  >/ Amax ( AAT ) / Amin (AAT)  is  the  condition  number  of  matrix  A,  and 
Kf  =  Lf/uf  is  the  condition  number  of  function  f . 


Proof.  Recall  the  important  inequality  (3.37)  in  Theorem  3.1: 


*  1 1 2 


k+1  ,*112 


\U  Vb  ||q  | Vu  ||^f 


>2uf\\xk+1  -  :r*||2  +  2ug\\yk+1  -  y*  ||2  +  \\uk  -  uk+1\\2G  +  2uf\\xk  -  xk+1\\2. 
Note  that  the  term  Uf\\xk+1  —  x*||2  on  the  right-hand  side  comes  from  (3.22): 


(4.20) 


(xk+1  -  x *,  AT{ \k+1  -  A*))  >  isf\\xk+L  -  x* ||2, 


fc+l  *||  2 


due  to  the  strong  convexity  of  /  and  the  optimality  conditions: 


(4.21) 


AT\t+1  =  Vf(xk+1),  ATX‘  =  Vf(x’). 


On  the  other  hand,  since  V/  is  Lipschitz  continuous  and  A  has  full  row  rank,  using 
(2.34)  yields 

( xk+l-x *,  AT(Afc+1-A*)>  >  -^||AT(Afc+1-A*)||2  >  Amin(^r)  || Afc+1  —  A* ||2.  (4.22) 

Lf  Lf 

By  combining  (4.21)  and  (4.22),  it  follows  that  for  any  t  E  [0, 1], 

(xk+1—x*,  AT(\k+1-\*))  >  t-uf\\xk+1—x*\\2+(l—t)  Amin^^T)  || Afc+1  —  A* ||2.  (4.23) 

Lf 
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Now,  using  (4.23)  to  replace  (4.21)  in  our  analysis  in  Section  3.3,  the  inequality  (4.20) 
can  be  further  refined  as 


\uk  ~  u*fG  -  \\uk+1  -  u*\\2g 


>  2 1  •  vf\\xk+1  -  x*\\2  +  2(1  -  t)W^T)||Afc+1  _  a*||2 

Lf 


+  2ug\\ yk+1  -  y*  ||2  +  ||  uk  -  uk+l\\2G  +  2vf\\xk  -  xk+1\\2,  Vt  e  [0, 1], 
In  particular,  letting 

t  —  I  1  + 


LfvI 


/32||A||2Amin(AL4r) 


-i 


we  have 


(4.24) 


(4.25) 


2 1  •  uj\\xk+1  -  x* || 2  +  2(1  -  t)Xrain[AAT\\\k+1  -  A* || 2 

Lf 

>5  M||A||2||a;fc+1  -  x* || 2  +  ^||Afc+1  -  A*||2^  >  5\\uk+1  -  u*\\2G,  (4.26) 

where  <5  >  0  is  given  by  (4.18).  Then  (4.1)  follows  from  (4.24)  and  (4.26)  immediately. 


□ 


Not  surprisingly,  the  convergence  rate  under  scenario  1  is  negatively  affected  by 
the  condition  numbers  of  A  and  /.  For  other  scenarios,  the  formulas  of  <5  can  also  be 
similarly  obtained  by  deriving  the  specific  values  of  Cj’s  in  our  analysis.  However,  they 
appear  to  be  more  complicated  than  the  nice  formula  (4.19)  for  scenario  1.  A  close 
look  at  these  formulas  of  Cj’s  reveals  that  the  convergence  rate  is  negatively  affected 
by  the  condition  numbers  of  the  constraint  matrices  A,  B  and  [A,  B],  as  well  as  the 
condition  numbers  of  the  objective  functions  /  and  g.  Due  to  page  limit,  we  leave 
other  scenarios/cases  and  further  analysis  to  future  research. 

Remark  4.2.  It  is  well  known  that  the  penalty  parameter  f3  can  significantly  affect  the 
speed  of  ADMM.  Since  the  rate  of  convergence  developed  in  this  section  is  a  function 
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of  (3,  the  rate  can  be  optimized  over  (3  which  sheds  some  lights  on  how  to  choose  a 
“good”  j3.  More  analysis  and  numerical  simulations  are  left  as  future  research. 

4.2.3  Comparison  with  Linear  Rate  of  DRSM 

It  is  known  that  applying  the  classic  ADMM  to  problem  (2.6)  is  equivalent  to  applying 
the  Douglas- Rachford  splitting  method  (DRSM)  to  the  dual  of  (2.6).  (However,  it 
is  unclear  to  which  splitting  methods  the  various  ADMM  generalizations  correspond 
to.)  In  this  subsection,  we  review  the  classic  linear  convergence  result  [40]  of  DRSM. 
In  comparison,  we  show  that  our  linear  rate  for  the  classic  ADMM  is  considerably 
better  than  the  one  in  [40]. 

The  dual  of  (2.6)  is  given  by 

min  <  —  min  f(x)  +  g{y)  —  AT(Ax  +  By  —  b)  1  =  min  f*(AT A)  +  g*(BT A)  —  bT A, 

A  [  x,y  J  A 

(4.27) 

where  /*  and  g*  are  the  convex  conjugate  functions  of  /  and  g ,  respectively.  Define 
the  maximal  monotone  operators  A  and  B  as  follows: 

A(X)  :=  d[g*(BT A)]  -  b ,  B( A)  :=  d[f*(AT A)].  (4.28) 

Then  (4.27)  is  equivalent  to  finding  a  zero  of  the  sum  of  two  maximal  monotone 
operators: 

0gM(A)  +  H(A).  (4.29) 

Applying  DRSM  to  the  above  problem  yields  the  following  algorithm: 

vk+1  =  J_5( 2  -  I)vk  +  (/  -  4)v\ 

\k+1  =  J%vk+1, 


(4.30) 

(4.31) 
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where  =  (/  +  /L4)_1  and  Jg  —  (/  +  /3£>)_1  are  the  resolvent  operators.  After  some 
calculation,  it  can  be  shown  that  this  algorithm  is  equivalent  to  the  classic  ADMM 
(Algorithm  1)  [20].  Here,  the  variable  v  corresponds  to 

vk  =  @Axk  +  Xk.  (4.32) 

The  linear  convergence  of  DRSM  was  established  by  Lions  and  Merciers  [40] .  We 
summarize  their  result  in  the  following  theorem. 

Theorem  4.4  (Lions  and  Mercier  [40]).  Assume  the  operator  B  is  both  coercive  and 
Lipschitz.  Namely,  there  exists  a  >  0  and  M  >  0  such  that 

(B( AO  -  B( A2),  Ax  -  A2>  >  a||A!  -  A2||2,  (4.33) 

||8(Ai)-8(A2)||  <M||A1-A2||.  (4.34) 

Then,  there  exists  a  constant  C  >  0  such  that 

II Afc  -  A*||2  <  C  •  9k ,  ||vfc+1  -  n*||2  <  9  ■  \\vk  -  u*||2,  (4.35) 


where 


The  smallest  9  is  given  by 


9  =  1- 


2  /3a 

(1  +  /3M)2' 


which  corresponds  to  fd  =  1/M. 


(4.36) 


(4.37) 


Under  the  assumptions  of  Scenario  1  of  Table  4.1,  /  is  strongly  convex  with 
constant  uj,  gradient  V/  is  Lipschitz  with  constant  Lf,  and  matrix  A  has  full  row 
rank.  Then  the  operator  B  :=  d[f*  odT]  is  coercive  and  Lipschitz  with  the  constants 


a  =  \min(AAr)/Lf ,  M=\\Af/uf. 


(4.38) 
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Hence,  the  linear  convergence  of  AD  MM  follows  from  Theorem  4.4,  and  the  optimal 
linear  rate  is  given  by 


Xm^(AAT)^f  _  _  1 

2PII2£/  “  2k>/ 


In  contrast,  our  linear  rate  (4.19)  is  given  by 


(4.39) 


1 

1  T  Cmax 


1  -  <5max  +  0(6. ^ax) 


(4.40) 


which  is  better  than  (4.39).  By  careful  inspection,  it  is  also  clear  that  our  linear  rate 
(4.18)  considerably  improves  the  classic  rate  (4.36)  in  [40]. 


4.3  Applications 

This  section  describes  several  well-known  optimization  models  on  which  Algorithm  2 
not  only  enjoys  global  linear  convergence  but  also  often  has  easy-to-solve  subproblems. 

4.3.1  Convex  Regularization 

Consider  the  convex  regularization  models  discussed  in  Section  2.3: 

min  f(Ax  —  b)  +  g(x),  (4-41) 

X 

where  /  is  the  loss  function  and  g  is  the  regularization  function.  In  many  applications, 
the  loss  function  /  is  typically  a  strongly  convex  function  with  Lipschitz  continuous 
gradient,  such  as  the  commonly  used  squared  t2-norm  ||  •  \\\  for  least  squares.  If 
matrix  A  has  full  column  rank,  then  the  term  f(Ax  —  b )  is  strongly  convex  in  x  and 
the  following  reformulation  of  the  problem  (4.41): 

min  f(Ax  —  b)  +  g(y) 

x,y 


s.t.  x  —  y  —  0, 


(4.42) 
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satisfies  the  scenario  1  in  Table  4.1. 

If  matrix  A  is  column-rank  deficient,  then  f(Ax  —  b)  is  not  strongly  convex  in 
x.  Alternatively,  we  may  consider  the  following  equivalent  problem  after  introducing 
y  =  Ax  —  b: 


min  g(x)  +  f(y ) 

x,y 

s.t.  Ax  —  y  =  b. 


(4.43) 


Then  it  satisfies  the  scenario  1  in  Table  4.1  if  the  x-subproblem  can  be  solved  exactly 
with  P  =  0  (since  full  column  rank  of  A  is  needed  if  P  y  0).  Moreover,  if  the 
regularization  function  g  is  also  strongly  convex,  then  the  problem  will  satisfy  the 
scenorio  2  or  4,  depending  on  whether  g  has  Lipschitz  continuous  gradient.  As  one 
of  the  most  commonly  used  regularization  methods,  the  Tikhonov  regularization ,  also 
known  as  ridge  regression  in  statistics: 


min  \\Ax  —  6||2  +  A||a:||2  (A  >  0)  (4.44) 

X 

is  an  example  of  scenario  4,  that  both  /  and  g  are  strongly  convex  and  have  Lipschitz 
continuous  gradients. 


4.3.2  Sparse  Optimization 
Elastic  net  (augmented  i\)  model: 

min  ||x||i  T  cc  1 1  x  1 1 2  T  1 1 Ax  —  6 1 1 2 ,  (4.45) 

x  2y 

where  A  6  Wnxn,  a  >  0  and  y  >  0  are  parameters.  It  has  been  shown  that  the 
elastic  model  can  effectively  recover  sparse  vectors  and  outperform  Lasso  (a  =  0)  on 
reported  real- world  regression  problems  [63].  With  the  constraint  x  —  y,  (4.45)  can 
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be  reformulated  as: 


min  ||?/||i  +  cc||a;||2  +  —  \\Ax  —  b ||" 

x,y  2/i 


(4.46) 


s.t.  x  —  y  —  0. 

Similarly,  the  elastic  net  model  can  be  extended  for  recovering  low-rank  matrices. 

Augmented  nuclear-norm  model: 


min  11X11*  +  a\\X\\2F  +  ^\\AX)  ~  b\\ 2, 


(4.47) 


where  a  >  0  and  /x  >  0  are  parameters,  A  :  Wnxn  — y  Mp  is  a  linear  operator,  ||  •  ||_f 
denotes  the  Frobenius  norm,  and  ||  •  ||*  denotes  the  nuclear  norm.  By  variable  splitting 
X  =  Y,  (4.47)  can  be  reformulated  as: 


mm\\Y\U  +  a\\X\\l  +  -\\A(X)-b\\2 
s.t.  X  -  Y  =  0. 


(4.48) 


In  (4.46)  and  (4.48),  both  of  the  functions  f(x)  =  cr||cc|| 2  +  ^|| Ax  —  b\\2  and 
f(X)  =  a || X || p  +  2^|| A{X)  —  b || 2  are  strongly  convex  and  have  Lipschitz  continuous 
gradients.  Therefore,  they  satisfy  the  scenario  1  of  Table  4.1. 


4.3.3  Consensus  and  Sharing  Optimization 

As  discussed  in  Section  2.3,  we  consider  the  global  consensus  problem  with  regular¬ 
ization: 

N 


(4.49) 


min 

{*i},y 

i— 1 

)  +  9(y) 

X\ 

-i 

s.t. 

XN 

" 

-i 

y  = 
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and  the  sharing  problem: 

N  /  N  \ 

mm  ^2fi(xi)+g  l^Vi  ,  s.t.  xt  -  yi  =  0,  i  =  1, . . . ,  N,  (4.50) 
{zibtei  i=1  V*=i  / 

where  /j’s  are  local  cost  functions  and  g  is  the  regularization  or  shared  cost  function. 

If  each  function  /,  is  strongly  convex  and  has  Lipschitz  continuous  gradient,  then 
both  problems  (4.49)  and  (4.50)  satisfy  the  conditions  in  scenario  1  of  Table  4.1.  If  the 
function  g  is  strongly  convex  and  both  /  and  g  have  Lipschitz  continuous  gradients, 
then  problem  (4.49)  satisfies  scenario  3  or  4,  depending  on  whether  each  function  /) 
is  strongly  convex. 

4.4  Numerical  Demonstration 

We  present  some  simple  numerical  results  to  demonstrate  the  linear  convergence  of 
Algorithm  2  on  the  elastic  net  and  distributed  Lasso  problems. 

4.4.1  Elastic  Net 

We  apply  Algorithm  2  with  P  =  0  and  Q  =  0  to  a  small  elastic  net  problem  (4.46), 
where  the  feature  matrix  A  has  m  =  250  examples  and  n  =  1000  features.  We  first 
generated  the  matrix  A  from  the  standard  Gaussian  distribution  A/"(0, 1)  and  then 
orthonormalized  its  rows.  A  sparse  vector  x°  G  M”  was  generated  with  25  nonzero 
entries,  each  sampled  from  the  standard  Gaussian  distribution.  The  observation 
vector  b  G  was  then  computed  by  b  =  Ax°  +  e,  where  e  ~  A/"(0, 10-3/).  We  chose 
the  model  parameters  a  =  0.1  and  /i  =  10~2,  which  we  found  to  yield  reasonable 
accuracy  for  recovering  the  sparse  solution.  We  initialized  all  the  variables  at  zero 
and  set  the  algorithm  parameters  (3  =  100  and  7  =  1.  We  ran  the  algorithm  for  200 
iterations  and  recorded  the  errors  at  each  iteration  with  respect  to  a  precomputed 
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reference  solution  u* . 

Figure  4.1(a)  shows  the  decreasing  behavior  of  \\uk  —  u*\\2G{\  =  (3\\xk  — x*||2  +  ||Afc  — 
X*\\2//3)  as  the  algorithm  progresses.  Since  variable  y  is  not  contained  in  the  G-norm, 
we  also  plot  the  convergence  curve  of  \\yk  —  y* ||2  in  Figure  4.1(b).  We  observe  that 
both  uk  and  yk  converge  at  similar  linear  rates.  In  addition,  the  convergence  appears 
to  have  different  stages.  The  later  stage  exhibits  faster  convergence  rate  than  the 
earlier  stage.  This  can  be  clearly  seen  in  Figure  4.2  which  depicts  the  Q-linear  rate 
|| uk+l  -  u*\\2G/\\uk  -  u*\\2g. 


(a)  \\uk  —  u*\\q  vs.  iteration  (b)  \\yk  —  y*\\2  vs.  iteration 

Figure  4.1  :  Elastic  net:  convergence  curves  of  ADMM. 

Here,  the  strong  convexity  constant  of  /  is  Uf  =  2a  +  X min(HTH)//r  =  2a  and 
the  Lipschitz  constant  of  V/  is  Lf  =  2a  +  X max(ATA)/y  =  2 a  +  l//n  By  (4.18), 
our  bound  for  the  global  linear  rate  is  (1  +  <5)_1  =  0.996,  which  roughly  matches 
the  early-stage  rate  shown  in  the  figure.  However,  our  theoretical  bound  is  rather 
conservative,  since  it  is  a  global  worst-case  bound  and  it  does  not  take  into  account 
the  properties  of  the  l\  norm  and  the  solution.  In  fact,  the  optimal  solution  x*  is 
very  sparse  and  xk  will  also  become  sparse  after  a  number  of  iterations.  Let  S  be  an 
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Convergence  rate  of  uk 


Figure  4.2  :  Elastic  net:  Q-linear  convergence  rate  of  ADMM. 


index  set  of  the  nonzero  support  of  ( xk  —  x*),  and  be  a  submatrix  composed  of 
those  columns  of  A  indexed  by  S.  Then,  the  constants  Uf  and  Lf  in  our  bound  can 
be  effectively  replaced  by  Vf  —  2a  +  Xmin(AgAs)  /  /x  and  Lf  =  2a  +  XmajX(AgAs)//x, 
thereby  accounting  for  the  faster  convergence  rate  in  the  later  stage.  For  example, 
letting  S  be  the  nonzero  support  of  the  optimal  solution  x* ,  we  obtain  an  estimate 
of  the  (asymptotic)  linear  rate  (1  +  5)”1  =  0.817,  which  well  matches  the  later-stage 
rate. 


4.4.2  Distributed  Lasso 


We  consider  solving  the  Lasso  problem  in  a  distributed  way  [41]: 


N 


1 


i=l 


mm 


(4.51) 


s.t.  Xi  -  y  =  0,  %  =  1,  ...,1V, 

which  is  an  instance  of  the  global  consensus  problem  with  regularization  (2.21). 
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We  apply  Algorithm  2  with  P  =  0  and  Q  =  0  to  a  small  distributed  Lasso  problem 
(4.51)  with  N  =  5,  where  each  A,  has  m  =  600  examples  and  n  =  500  features.  Each 
Ai  is  a  tall  matrix  and  has  full  column  rank,  yielding  a  strongly  convex  objective 
function  in  X{.  Therefore,  Algorithm  2  is  guaranteed  to  converge  linearly. 

We  generated  the  data  similarly  as  in  the  elastic  net  test.  We  randomly  generated 
each  Ai  from  the  standard  Gaussian  distribution  A7(0,l),  and  then  simply  scaled 
its  columns  to  have  a  unit  length.  We  generated  a  sparse  vector  x°  £  Rn  with  250 
nonzero  entries,  each  sampled  from  the  A/"(0, 1)  distribution.  Each  bi  €  Mm  was  then 
computed  by  bi  =  AiX°  +  6j,  where  e*  ~  A/"(0, 10~3/).  We  chose  the  model  parameter 
H  =  0.1,  which  we  found  to  yield  reasonably  good  recovery  quality.  From  the  initial 
point  at  zero,  we  ran  the  algorithm  with  parameters  {3  —  10  and  7  =  1  for  50  iterations 
and  computed  the  iterative  errors. 

Figure  4.3  demonstrates  the  clear  linear  convergence  behavior  of  ||wfc  —  u* ||^  and 
|| yk  —  y* ||2.  In  Figure  4.4,  the  Q-linear  convergence  rate  of  \\uk  —  u*\\q  is  depicted. 
For  this  problem,  the  strong  convexity  constant  is  Vf  =  minj{Amin(Af  Ai)/^}  and  the 
Lipschitz  constant  is  Lf  —  maXj{Amax(Af  Ai)/fi}.  However,  the  condition  number 
Vf/Lf  in  this  test  is  relatively  big,  and  hence  the  theoretical  linear  rate  specified  by 
(4.19)  is  not  a  very  tight  bound  for  the  observed  fast  rate.  Note  that  all  Xi  s  tend 
to  be  equal  and  become  sparse  after  a  number  of  iterations.  Similar  to  our  previous 
discussion  in  Section  4.4.1,  we  can  estimate  the  asymptotic  linear  rate  by  letting 
Vf  =  \min(AgAs) / (/iN)  and  Lf  =  \max(AgAs) / (/j,N) ,  where  A  e  MJVmXTl  is  formed 
by  stacking  all  the  matrices  Ai  (i  —  1 , ,N),  and  S  is  an  index  set  of  the  nonzero 
support  of  x*.  We  obtained  the  asymptotic  linear  rate  to  be  (1  +  h)^1  =  0.779,  which 
appears  to  be  a  much  tighter  bound. 
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(a)  \\uk  —  u*\\q  vs.  iteration  (b)  \\yk  —  y*\\2  vs.  iteration 

Figure  4.3  :  Distributed  Lasso:  convergence  curves  of  ADMM. 


Convergence  rate  of  uk 


Figure  4.4  :  Distributed  Lasso:  Q-linear  convergence  rate  of  ADMM. 
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Chapter  5 

Parallel  Multi-Block  ADMM 

This  chapter  introduces  a  parallel  and  multi-block  extension  to  the  alternating  direc¬ 
tion  method  of  multipliers  (ADMM)  for  solving  convex  problem: 

minimize  /i(x,)  4 - f-  /jv(xjv) 

subject  to  A]Xi  +  •  •  •  +  AjvXtv  =  c, 
xi  G  X\ ,  . . . ,  x^v  G  Aat. 

The  proposed  algorithm  decomposes  the  original  problem  into  N  smaller  subproblems 
and  solves  them  in  parallel  at  each  iteration.  It  is  well  suited  to  distributed  computing 
and  is  particularly  attractive  for  solving  certain  large-scale  problems. 

This  chapter  is  organized  as  follows.  In  Section  5.1,  we  briefly  review  some  existing 
parallel  and  distributed  algorithms  for  solving  the  above  optimization  problem.  Then 
we  introduce  a  few  novel  results  in  the  following  sections.  In  Section  5.2,  we  show 
that  extending  ADMM  straightforwardly  from  the  classic  Gauss-Seidcl  setting  to  the 
Jacobi  setting,  from  2  blocks  to  N  blocks,  will  preserve  convergence  if  matrices  A* 
are  mutually  near-orthogonal  and  have  full  column-rank.  In  Section  5.3,  for  general 
matrices  Aj,  we  propose  to  add  proximal  terms  of  different  kinds  to  the  N  subproblems 
so  that  the  subproblems  can  be  solved  in  flexible  and  efficient  ways  and  the  algorithm 
converges  globally  at  a  rate  of  o{l/k).  We  also  introduce  a  strategy  for  dynamically 
tuning  the  parameters  in  the  algorithm,  often  leading  to  substantial  acceleration  of  the 
convergence  in  practice.  In  Section  5.4,  numerical  results  are  presented  to  demonstrate 
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the  efficiency  of  the  proposed  algorithm  in  comparison  with  several  existing  parallel 
algorithms.  We  also  implemented  onr  algorithm  on  Amazon  EC2,  an  on-demand 
public  computing  cloud,  and  report  its  performance  on  very  large-scale  basis  pursuit 
problems  with  distributed  data. 


5.1  Introduction 


We  consider  the  following  convex  optimization  problem  with  N  (N  >  2)  blocks  of 
variables: 


N 


N 


mm 

Xl,X2,....X]V 


5>(x0  s.t.  djXj  =  c,  (5.1) 

i=l  i= 1 

where  Xj  G  A*  G  Mmxni,  c  G  Mm,  and  /*  :  Mni  — >  (— oo,  +oo]  are  closed  proper 
convex  functions,  i  =  1,2,...,  AT.  If  an  individual  block  is  subject  to  constraint 
Xj  G  Aj,  where  A)  C  is  a  nonempty  closed  convex  set,  it  can  be  incorporated  in 
the  objective  function  ft  using  the  indicator  function: 


I*(xi)  = 


0  if  Xj  G  Aj, 


(5.2) 


+cx)  otherwise. 

In  this  study,  we  do  not  impose  any  additional  assumptions  such  as  strict  convexity 
or  differentiability  on  the  objective  functions  /*. 

The  problem  (5.1)  is  also  referred  to  as  an  extended  monotropic  programming 
problem  [2].  The  special  case  that  each  x,-  is  a  scalar  (i.e. ,  n*  =  1)  is  called  a 
monotropic  programming  problem  [51].  Such  optimization  problems  arise  from  a 
broad  spectrum  of  applications  including  numerical  partial  differential  equations, 
signal  and  image  processing,  compressive  sensing,  statistics  and  machine  learning. 
See  [1,4,8,22,43,46,47,55,60]  and  the  references  therein  for  a  number  of  examples. 

We  focus  on  parallel  and  distributed  optimization  algorithms  for  solving  the  prob¬ 
lem  (5.1).  Due  to  the  dramatically  increasing  demand  for  dealing  with  big  data, 


66 


parallel  and  distributed  computational  methods  are  highly  desirable.  Since  both  of 
the  objective  function  and  constraints  of  (5.1)  are  summations  of  terms  on  individ¬ 
ual  Xj’s  (we  call  them  separable) ,  the  problem  can  be  decomposed  into  N  smaller 
subproblems,  which  can  be  solved  in  a  parallel  and  distributed  manner. 


5.1.1  Literature  review 


A  simple  distributed  algorithm  for  solving  (5.1)  is  dual  decomposition  [19],  which  is 
essentially  a  dual  ascent  method  or  dual  subgradient  method  [53]  as  follows.  Consider 
the  Lagrangian  for  problem  (5.1): 

N  /  N  \ 

£(xi, . . . ,  Xjy,  A)  =  fi(xi)  -  XT  (  ^2  _  c  )  (5-3) 

i= 1  \  i=l  / 

where  A  G  Mm  is  the  Lagrangian  multiplier  or  the  dual  variable.  The  method  of  dual 
decomposition  iterates  as  follows:  for  k  >  1, 


_fc+l  __&+ 1 


xw+1)  =  argminf> 


£(x1,...,xAr,At), 


(5.4) 


[  A‘+1  =  A‘  -  at  (E" ,  4t.xf+1  -  c)  , 

where  o;^.  >  0  is  a  step-size.  Since  all  the  x*’s  are  separable  in  the  Lagrangian  function 
(5.3),  the  x-update  step  reduces  to  solving  N  individual  x,-subproblems: 


xf+1  =  arg  min  /j(xj)  -  (Afc,  A»Xj),  for  i  =  1,  2, . . . ,  N,  (5.5) 

Xi 

and  thus  they  can  be  carried  out  in  parallel.  With  suitable  choice  of  a k  and  certain 
assumptions,  dual  decomposition  is  guaranteed  to  converge  to  an  optimal  solution 
[53].  However,  the  convergence  of  such  subgradient  method  often  tends  to  be  slow  in 
practice.  Its  convergence  rate  for  general  convex  problems  is  0(1/ Vk). 

Another  effective  distributed  approach  is  based  on  the  alternating  direction  method 
of  multipliers  (ADMM).  As  we  know,  AD  MM  was  introduced  to  solve  the  special  case 
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of  problem  (5.1)  with  only  two  blocks  of  variables  (i.e. ,  N  =  2).  To  solve  the  problem 
(5.1)  with  N  >  3,  one  can  first  convert  the  multi-block  problem  into  an  equivalent 
two-block  problem  and  then  solve  it  by  ADMM.  The  problem  reformulation  is  done 
via  variable  splitting  [1]: 

min 
(xi},{zd 

s.t. 


or  equivalently, 

min 

{xibizj 

s.t. 

where  Iz  is  an  indicator  function  defined  by  (5.2),  and  the  convex  set  Z  is  given  by 

Z  =  j(zi,. . .  ,zjv)  :  =  0  j  . 

The  variables  z?:  (we  call  them  splitting  variables )  are  introduced  to  decouple  the  Xj’s 
in  the  constraints.  Now  we  group  the  variables  into  two  blocks:  x  :=  (xi, . . . ,  x^v)  and 
z  :=  (zi, . . . ,  zjv)-  Then  ADMM  can  be  applied  directly.  The  augmented  Lagrangian 
for  (5.7)  is  given  by 

N  N  „  N 

£/3(x,z,A)  =  ^/i(xi)  +  J2(z)-^A7  (A;X,  -  z,  + 

i=l  i=  1  i= 1 

(5.8) 


N 


i=l 


'4‘x‘-z<=  JV’  Vi  =  1’2’ 


,JV, 


N 


=  °» 


i=  1 


(5.6) 


N 


]>>(x,)  +  -Tz(zi,---,zjv) 


i=  1 


Axi  —  z,  —  — ,  Vi  =  l,2,...,JV, 


(5.7) 


Since  all  the  Xj’s  are  now  fully  decoupled,  the  resulting  x-subproblem  decomposes 
into  N  individual  x? -subp r o b lerns : 


I .  _L  1  ■  n  /  \ 

x,;  =  arg  mm  A  (x*)  +  - 


A,xt 


k-\- 1 


C 

N 


A? 

P 


(5.9) 
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which  can  be  carried  out  in  parallel.  The  resulting  z-subproblem  is  a  simple  quadratic 
problem: 


N 


L- Li 

z  =  argmin 

{»=Efel»i=0}  2 


E 


\  k  ^ 

A  c  K 

A,'X;  —  Zj - - 

1  N  p 


(5.10) 


which  admits  a  closed-form  solution: 


(5.11) 


We  summarize  the  algorithm  below: 


Algorithm  3:  Variable  Splitting  ADMM  (VSADMM) 


1  Initialize  x°,  A0,  f3  >  0; 

2  for  k  —  0,  1, ...  do 

Update  Zj  then  x*  for  i  —  1, . . . ,  N  in  parallel  by: 

{ElLiV?-*- 


,k+l  _  (  A  _  c_  __  K_  \  _  1_ 
~  I  N  8  N 


k+1 


=  argmin  /;(x;)  +  §  1 1  Ax*  -  zf+1  -  j,  -  f- 


Update  A*fc+1  =\ki-P  (A 


rk+ 1 


-  z 


fc+1 


This  ADMM  approach  based  on  (5.7),  by  introducing  splitting  variables,  sub¬ 
stantially  increases  the  number  of  variables  and  constraints  in  the  problem,  especially 
when  N  is  large.  Alternatively,  we  aim  to  develop  a  parallel  and  multi-block  extension 
of  ADMM  that  solves  (5.1)  directly  without  splitting  variables. 

A  straightforward  multi-block  extension  of  ADMM  is  to  use  a  sweep  of  Gauss- 
Seidel  update  to  minimize  the  augmented  Lagrangian.  Namely,  it  updates  x*  for 
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i  —  1,  2, . . . ,  N  sequentially  as  follows: 


xf+1  =  arg  min  ,  xf+A  x;,  xf+1, . . . ,  x^,  A 


jfc+i 


fc+i 


,fc  \  fc\ 


=  arg  min  /)(x,; 


/3 


Xi 


S  AxJ+1  +  +  Y1  ajxj  ~c~ 


j<i  j>i 

where  Cp  is  the  augmented  Lagrangian  for  (5.1): 

N  /  N  \ 

£g(x  1,  ■  ■  ■  ,xjy,  A)  =  /i(xj)  -  AT  ^  A;X,:  -  c  )  + 


Afc 

7 


(5.12) 


P 


2=1 


.  2=1 


AT 


X)  ^x?;  - 


2=1 


(5.13) 


Such  Gauss-Seidel  ADMM  (Algorithm  4)  has  been  considered  lately,  e.g.,  in  [34,38]. 
It  has  been  shown  that  the  algorithm  may  not  converge  for  N  >  3  [9] .  Although  lack  of 
convergence  guarantee,  some  empirical  studies  show  that  the  Gauss-Seidel  ADMM  is 
still  very  effective  at  solving  many  practical  problems  (see,  e.g.,  [47,55,58]).  However, 
as  the  Gauss-Seidel  ADMM  updates  the  blocks  sequentially  one  after  another,  it  is 
not  amenable  for  parallelization. 


Algorithm  4:  Gauss-Seidel  ADMM 

1  Initialize  x°,  A0,  j3  >  0; 

2  for  k  —  0,  1, ...  do 

3 

Update  Xj  for  i  —  1, . . . ,  N  sequentially  by: 

x.f+1  =  minx.  fi(xi)  +  § 

E  j<i  Aj^+l  +  Ax,  +  Ejm  Ajxj  ~c~t 

2 

5 

4 

Update  Afc+1  =  Xk  -  p  (£*=i  A^+1  -  c) . 

5.1.2  Jacobi-Type  ADMM 

A  parallel  counterpart  of  the  Gauss-Seidel  ADMM  is  the  Jacobi  ADMM  (see  Algo¬ 
rithm  5  below).  Unlike  the  Gauss-Seidel  ADMM,  the  Jacobi  ADMM  updates  all  the 
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blocks  in  parallel  at  every  iteration: 


x,;+  =  arg  min  £p(x.f,  .  . .  xf+1, . . .  ,xiV,  A  ’) 


•  ft  \  _i_  ^ 

=  arg  mm  /?;(x,;)  +  - 


Afc 

7 


,  Vi  =  1,...,AA  (5.14) 


Therefore,  it  is  well  suited  to  parallel  and  distributed  computation. 


Algorithm  5:  Jacobi  ADMM 

i  Initialize  x°,  A0,  (3  >  0; 

2  for  k  =  0,  1, ...  do 

3 

Update  Xj  for  i  —  1, . . . ,  N  in  parallel  by: 

x- +1  =  arg  minx.  ^(x*)  +  f 

+  Ytj#  Ai*j  ~  c  -  t 

2 

4 

Update  Afc+1  =  A k  —  /3 

i  Ai*i+1  ~  c) ■ 

On  the  other  hand,  such  straightforward  parallelization  comes  with  a  cost:  the 
augmented  Lagrangian  is  minimized  “less  accurately”  than  it  is  in  the  Gauss-Seidel 
scheme.  Hence,  this  Jacobi  ADMM  is  more  likely  to  diverge  than  the  Gauss-Seidel 
ADMM.  In  fact,  it  may  diverge  even  when  N  =  2;  see  [33]  for  such  an  example.  A 
few  variants  of  Jacobi  ADMM  have  been  proposed  which  take  certain  correction  steps 
to  ensure  its  convergence  [32,33]. 

In  order  to  guarantee  the  convergence  of  Jacobi  ADMM,  either  additional  as¬ 
sumptions  or  modifications  to  Algorithm  5  must  be  made.  Therefore,  we  discuss 
along  these  two  lines  in  the  rest  of  the  chapter.  In  Section  5.2,  we  provide  a  sufficient 
condition  for  the  convergence  of  Algorithm  5  by  assuming  the  “near-orthogonality” 
of  the  matrices  Aj.  In  Section  5.3,  we  propose  a  simple  modification  to  Algorithm  5 
-  called  Jacobi-Proximal  ADMM  —  which  converges  at  a  rate  of  o{l/k). 
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5.1.3  Notation  and  Assumptions 


To  simplify  the  notation,  we  introduce 


x  :  = 


Ai, . . . ,  An 


mxn 


,  u  :  = 


e  Mn+m, 


W 


where 


Throughout  this  chapter,  we  make  the  following  standard  assumptions. 

Assumption  5.1.  Functions  fi  :  Mni  — y  (— oo,  +oo]  {i  =  1,2,  ...,N)  are  closed 
proper  convex. 

Assumption  5.2.  There  exists  a  saddle  point  u*  =  (x^,  x^, . . . ,  x^r,  A*)  to  the  problem 
(5.1).  Namely,  u*  satisfies  the  KKT  conditions: 


(5.15) 


N 


i= 1 


The  optimality  conditions  (5.15)  and  (5.16)  can  be  written  in  a  more  compact  form 


by  the  following  variational  inequality  [33]: 


/(x)  -  /(x*)  +  (u  -  u*)tF(u*)  >  0,  Vu, 


(5.17) 


where  /(x)  :=  ^t/i(x*)  and 


( -AJ\\ 


u): 


AjfX 

^Ax  —  cj 
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5.2  On  the  Convergence  of  Jacobi  ADMM 

In  this  section,  we  provide  a  sufficient  condition  to  guarantee  the  convergence  of  Ja¬ 
cobi  ADMM  (Algorithm  5).  The  condition  only  depends  on  the  coefficient  matrices 
Ai,  without  imposing  further  assumptions  on  the  objective  functions  /*  or  the  penalty 
parameter  /3.  For  the  Gauss-Seidel  ADMM  (Algorithm  4),  a  sufficient  condition  for 
convergence  is  given  in  [9]  for  the  special  case  N  =  3,  assuming  two  of  the  three 
coefficient  matrices  are  orthogonal.  Our  condition  does  not  require  exact  orthogonal¬ 
ity.  Instead,  we  mainly  assume  that  the  matrices  Ai:  i  =  1,2,...,  iV  are  mutually 
“near-orthogonal”  and  have  full  column-rank. 

Theorem  5.1.  Suppose  that  there  exists  S  >=  0  such  that 

||A7Aj||  <6,  V  j,  and  A  min{Aj  Af)  >  3  (N  -  1)J,  V  i,  (5.18) 

where  A min(Aj  Ai)  denotes  the  smallest  eigenvalue  of  Aj Ai.  Then  the  sequence  {ufc} 
generated  by  Algorithm  5  converges  to  a  solution  u*  to  the  problem  (5.1). 

The  proof  technique  is  motivated  by  the  contraction  analysis  of  the  sequence  {ufc} 
under  some  G-norrn  (e.g.,  see  [13,33,36]).  To  prove  the  theorem,  we  first  need  the 
following  lemma: 


Lemma  5.1.  Let 

^AJA, 

\ 

fpAjA1 

AJ ' 

Go 

,  50:  = 

PAjfA^ 

An 

V 

IV 

\  • 

An 

IV 

where  I  is  the  identity  matrix  of  size  m  x  m.  For  k  >  1,  we  have 


-  u'fe  -  l|u‘+1  -  u-|||0  >  ||u‘  -  u‘+‘||!0, 


(5.19) 
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where 


u  —  u 


I  So 


=  u  —  u 


I  Go 


+  2(Afc  —  \k+1)TAfx.k  —  xfc+1). 


(5.20) 


This  lemma  follows  directly  from  Lemma  5.2  (which  will  be  proved  in  the  next 
section),  since  it  is  a  special  case  with  7  =  1  and  P*  =  0,  Vi.  Now  we  are  ready  to 
prove  the  theorem. 

Proof  of  Theorem  5.1.  By  the  assumption  UTl^TLjll  <5,  i  7^  j,  we  have 


Ajbj) 


<  <  i(N- l)(l|a||2  +  ||b||2),  V  a,  b.  (5.21) 


To  simplify  the  notation,  we  let 


:=  xf  -  x*,  i  =  1,2, ...  ,N. 


(5.22) 


Note  that 


\k  _  Xk+ 1 


PAi 


xk-\- 1 


_  xfc+i  =  afc  _  a*+i. 


Then,  we  can  rewrite  (5.20)  as 


ufc  -  ufc+1|||o  =  ^2  ||A(af  -  af+1)||2  +  ||v4afc+1 1|2  +  2(Aak+l ,  A(ak  -  ak+1)) 

i 

(5.23) 

=  T  ll-4.®?!!2  +  2  X)<Aaf+1,  ^a‘)  -  X>.a?+1,  Aj a*+‘)  (5.24) 
>  ^  ||  Aa- 1|2  -  (IV  -  l)h(||afc+1||2  +  ||afc||2)  -  (N  -  l)h||afc+1||2 

i 

(5.25) 

=  || Atak ||2  -  (.V  -  l)<5||afc||2  -  2 (N  -  l)5||afc+1||2,  (5.26) 

i 
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where  the  inequality  (5.25)  comes  from  (5.21).  By  Lemma  5.1,  we  have 
l|ufc-ul2Go~2(7V-  l)5/3||afc||2 

>||ufc+1  -  u*||^o  -  2(N  -  l)5/3||afc+1||2  +  [5  ^  ||/Uf  ||2  -  3(iV  -  l)5/3||afc||2.  (5.27) 

l 

We  further  simplify  (5.27)  as 

bk  -  bk+1  >  dk,  (5.28) 

where  the  sequences  {bk}  and  {dk}  are  defined  by 

bk  :=  ||ufc  -  u* || Go  -  2 (N  -  1 ) <5/5 1 1 afc 1 1 2 ,  (5.29) 

dk  ■=  ||7Uf||2  -  3(iV  -  l)<5/5||afc||2.  (5.30) 

i 

By  the  definition  of  Go,  we  have 

bk  =  HAafll2  -  2  (N  -  l)<S/?||a‘||2  +  l||Al  -  A*||2.  (5.31) 

Since  we  assume  Amm (Aj A,:)  >  3 (N  —  1)5,  it  follows  that 

ll^'H2  >3(iV-l)5||af ||2,  Vi.  (5.32) 


Then  it  is  easy  to  see  that  bk  >  0  and  dk  >  0.  By  (5.28),  the  nonnegative  sequence 
{bk}  is  monotonically  non-increasing.  Hence,  {bk}  converges  to  some  b*  >  0.  By 
(5.28),  it  also  follows  that  dk  — >  0.  Therefore,  ak  — >  0,  i.e.,  xfc  — >•  x*. 

Next  we  show  Xk  — >■  A*.  By  taking  limit  of  (5.31)  and  using  ak  — s-  0,  we  have 

1 


b*  =  lim  bk  =  lim  ^  || A*  —  A*||2. 

k — 5oo  /c— 5oo  p 


(5.33) 


To  show  Afc  — ^  A*,  it  thus  suffices  to  show  b*  =  0. 

By  (5.33),  {Afc}  is  bounded  and  must  have  a  convergent  subsequence  Xkj  — *  A. 
Recall  the  optimality  conditions  for  the  x, -subproblems  (5.14): 


A1  i  A*  -  P(A^  + 


i¥=i  / 


(5.34) 
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By  Theorem  24.4  of  [52],  taking  limit  over  the  subsequence  {kj}  on  both  sides  of 
(5.34)  yields: 

AjX  e  dfi(x*),  Vi.  (5.35) 

Therefore,  (x*,  A)  satishes  the  KKT  conditions  of  the  problem  (5.1).  Since  (x*,  A*)  is 
any  KKT  point,  now  we  let  A*  =  A.  By  (5.33)  and  ||Afcj  —  A*  || 2  — >■  0,  we  must  have 
b*  =  0,  thereby  completing  the  proof.  □ 

5.3  Jacobi-Proximal  ADMM 

In  this  section,  we  propose  the  Jacobi- Proximal  ADMM  (Algorithm  6).  Compared 
with  Algorithm  5,  we  introduce  a  proximal  term  1 1 1  x,;  —  xf||p.  (P*  >y  0)  for  each 
Xj-subproblem  and  a  damping  parameter  7  >  0  for  the  update  of  A. 


Algorithm  6:  Jacobi-Proximal  ADMM 

1  Initialize:  x°  ( i  =  1,2,. 

. . ,  N)  and  A0; 

2  for  k  —  0, 1, . . .  do 

3 

Update  Xj  for  i  —  1 

. . . ,  N  in  parallel  by: 

x,fc+1  =  arg  minx.  f,( 

x,:)  +  f 

Ai x*  +  Ej# Aixj  ~C~T 

2  ,,  9 

+  2  Ilxi  llpp 

4 

Update  Afe+1  =  Xk  - 

-70(E£.4*?+1-c). 

The  proposed  algorithm  has  a  few  advantages.  First  of  all,  as  we  will  show,  it 
enjoys  global  convergence  as  well  as  an  o{l/k)  convergence  rate  under  conditions  on  Pi 
and  7.  Secondly,  when  the  Xj-subproblem  is  not  strictly  convex,  adding  the  proximal 
term  can  make  the  subproblem  strictly  or  strongly  convex,  making  it  more  stable. 
Thirdly,  we  provide  multiple  choices  for  matrices  P*  with  which  the  subproblems  can 
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be  made  easier  to  solve.  For  example,  in  many  cases  that  Aj Ai  is  ill-conditioned 
or  computationally  expensive  to  invert,  the  x,-subproblem  becomes  difficult  to  solve 
exactly  which  contains  a  quadratic  term  f  XjA^kljXj.  One  can  let  P*  =  Dt  —  (3AJ Ai: 
which  effectively  replaces  the  quadratic  term  ,'x,/l,  /l,x,  by  ],x,I)jX;.  The  matrix  Di 
can  be  chosen  as  some  well-conditioned  and  simple  matrix  (such  as  identity  matrix, 
diagonal  matrix  and  so  on),  thereby  leading  to  an  easier  subproblem.  Various  choices 
of  p  have  been  discussed  in  Section  3.2.1.  The  use  of  flexible  proximal  terms  makes 
it  possible  to  solve  its  subproblems  in  different  ways,  important  for  easy  coding  and 
fast  computation. 

In  this  section,  we  mainly  study  the  convergence  of  the  Jacobi-Proximal  ADMM. 
We  first  show  its  convergence  and  then  establish  an  o(l/k)  convergence  rate  in  the 
same  sense  as  in  [35].  Furthermore,  we  discuss  how  to  tune  the  parameters  in  order  to 
make  the  algorithm  more  practical.  Our  numerical  results  on  the  exchange  problem 
and  the  basis  pursuit  problem  show  that  the  proposed  algorithm  achieves  competitive 
performance,  in  comparison  with  several  existing  parallel  algorithms. 


5.3.1  Convergence 

To  simplify  the  notation,  we  let 
(P1+PAJA1 

Gx  := 

V 


\ 


/ 


,  G  \— 


a 


\ 


1  T 

V  'TpJ 


Pn  +  /3AJjAn  j 
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and 


S  := 


Pi  +  fiAfAi 


±Aj\ 

7  1 


Pn  +  (3AJjAn  A Ajf 


(5.36) 


V  •••  ^  pi 

where  I  is  the  identity  matrix  of  size  m  x  m.  In  the  rest  of  the  section,  we  let  {ufc} 
denote  the  sequence  generated  by  Jacobi-Proximal  ADMM  from  any  initial  point. 
The  analysis  is  based  on  bounding  the  error  ||ufc  —  u*||g  and  estimating  its  decrease, 
motivated  by  the  works  [13,33,36]. 


Lemma  5.2.  For  k  >  1,  we  have 


u  —  u 


,  *  1 1  2 
I G 


ut+1  -  u*||J  >  ||u*  -  ufc+1|ll, 


(5.37) 


where 


u  —  u 


ft+l||2  _  ||„fc  fe+1  II 2 


IS 


=  x"  -  X" 


lGa 


+ 


2-7 

h2 


||Afc  -  Afc+1||2  +  ~{\k  -  A k+1)TA(x.k  -  xfc+1). 


7 


(5.38) 


Proof.  Recall  that  in  Algorithm  6,  we  solve  the  following  xrsubprob lerri : 


7*  I  1  .  n  f  \  ^ 

X.  1  =  argmm/Ax,;)  +  - 

XZj 

I 


A^  +  J2 


\k 

J 


+  2  HX*  _  Xi1IPj- 


Its  optimality  condition  is  given  by 
Aj  (v  - /3(A:X?+1  +  £  .4,x* 

V  j^i 


c)  I  +  P,:(xf  -  x^1)  G  <9/)(xf+1).  (5.39) 


fc+i'i 


For  convenience,  we  introduce  A  :=  Xk  —  /3(Axfc+1  —  c).  Then  (5.39)  can  be  rewritten 
as 

+  Pitf  ~  x4fc+1)  G  dffxk+1).  (5.40) 


AJ  (a-/3^A(x‘-x‘+1)) 

V  jA  / 
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By  Lemma  2.1,  it  follows  from  (5.15)  and  (5.40)  that 


^(x‘+1  -  X*),  A  -  A*  -  ft  £  —  x‘ 

,?W 


+  (xr-x*)'^-xr)>o. 

Summing  the  above  inequality  over  all  i  and  using  the  following  equality  for  each  i: 


E  A<(x‘  -  x‘+1)  =  A{xk  -  x‘+1)  -  A(x‘  -  x^1), 

we  obtain 

N 

(A(x-k+1  -  x*).  A  -  A*)  +  E(x?+1  -  x*)T(C  +  l3AjAi)(xk  -  xk+1) 


Z=1 


(5.41) 


>  p(A(-x.k+1  -  x*),  7L(xfc  -  xfc+1)>. 
Note  that 


7L(xfc+1-x*)  =  —  (Afc- Afc+1), 


and 


A  -  A*  =  (A  -  Afc+1)  +  (Afc+1  -  A*)  =  ^ — -(Afc  -  Afc+1)  +  (Afc+1  -  A*). 

7 

With  the  above  two  equations,  the  inequality  (5.41)  can  be  rewritten  as 
1  N 

<-g(A‘  -  At+1),  A‘+1  -  A*)  +  E(x?+1  -  x*)T(P.  +  -  x‘+1) 


1=1 


1-7, 


>  ^r-^llA*  -  Afc+1||2  +  -(Afc  -  Afc+1)TAl(x 
TP  7 

or  more  compactly, 


(5.42) 


k  xfc+1) 


(ufe-uH1)TG(uw-u*)  >  ^||A*-A*+1||2  +  -(A*-Afc+1)T4(x*-x*+1).  (5.43) 

TP  7 

Since  ||ufc  -  u*||^  -  ||ufc+1  -  u*||^  =  2(ufc  -  uk+1)T  G(uk+1  -  u*)  +  ||ufe  -  uk+1\\2G,  using 


the  above  inequality  (5.43)  yields  (5.37)  immediately. 


□ 
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If  the  matrix  S  (5.36)  is  positive  definite,  there  exists  some  rj  >  0  such  that 

||ufc  -  ufc+1|||  >  ^  .  ||ufc  _  ufc+i||2  >  o.  (5.44) 

Then  Lemma  5.2  indicates  that 

||ufc  -  u*\\l  -  ||ufc+1  -  u*\\2g  >  7]  ■  ||ufe  -  ufc+1||2.  (5.45) 

That  is,  the  iterative  sequence  {ufc}  is  strictly  contractive.  In  particular,  the  error 
||uA"  —  u*||^.  is  monotonically  non-increasing  and  thus  converging,  as  well  as  ||  ufc  — 
ufc+i||2  _j,  q  Then  the  convergence  of  the  algorithm  (||ufc  «  u*||g,  — >  0)  follows 
immediately  from  the  standard  analysis  for  contraction  methods  (see,  e.g.,  [31]).  We 
omit  the  details  of  the  proof  for  the  sake  of  brevity  and  state  the  convergence  theorem 
as  follows. 

Theorem  5.2.  Suppose  the  parameters  in  Algorithm  6  satisfy  that  the  matrix  S  (5.36) 
is  positive  definite.  Then  the  sequence  {ufc}  generated  by  Algorithm  6  converges  to  a 
solution  U*  to  the  problem  (5.1). 

In  the  following  theorem,  we  give  a  sufficient  condition  to  guarantee  that  S  is 
positive  definite.  It  basically  requires  the  parameters  P*  ( i  =  1,2,...,  iV)  to  be 
sufficiently  large. 

Theorem  5.3.  Suppose  the  parameters  /3,  7  and  Pi  ( i  =  1,2,  ...,1V)  satisfy  the 
following  condition: 

=  (546) 

1  E£i«<2-7, 

for  some  e*  >  0,  i  =  1,  2, . . . ,  N.  Then  the  matrix  S  (5.36)  is  positive  definite.  Thus, 
the  sequence  {ufc}  generated  by  Algorithm  6  converges  to  a  solution  11*  to  the  problem 
(5.1). 
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The  condition  (5.46)  can  be  reduced  to 

ATA,  i  =  1,2,...,  N,  (5.47) 

by  letting  each  e*  <  ^fL.  In  particular,  for  the  following  choices: 

•  Pi  =  Tjl  (standard  proximal),  condition  (5.47)  becomes  Ti  >  (3  (2^7  —  ||Ab||2;' 

•  Pi  =  Tjl  —  (3AJ Ai  (prox-linear) ,  condition  (5.47)  becomes  Ti  >  |^||ALj||2. 


Proof.  For  any  u  =  (x;  A)  G  M”"1"™,  we  have 


1  ii2  11  1 1 2  .  2  7 

mis  ■=  IMk  + 


h2 


Using  the  following  basic  inequality: 

N 


N 


ATAx  =  —  AT7LjXj  >  — 


7 


z=l 


7 


2  +  -A  TAx. 

7 


2  ,  P 


\Pi 


for  any  e*  >  0  (i  —  1,  2, . . . ,  N),  we  have 

N 


H — 1|  Ai x,j  ||‘ 


2-7-Eii* 


lulls  >  5^  llx*l12 


+ 


2=1 


(5.48) 


(5.49) 


(5.50) 


1  Pi  +/3AJ  Ai—f-^Aj  Ai  1 

The  condition  (5.46)  guarantees  that  Pi+/3Aj Ai  —  ^Aj Ai  >-  0  and  2  — 7— et  >  0, 
and  thus  ||u||s  >  0.  Therefore,  S  is  positive  definite.  The  rest  follows  immediately. 


□ 


Under  the  similar  near-orthogonality  assumption  on  the  matrices  Ai,  we  have  the 
following  convergence  result  for  Jacobi-Proximal  ADMM: 

Theorem  5.4.  Suppose  there  exists  5  >  0  such  that  ||Al7A)||  <  5  for  all  i  7^  j,  and 
the  parameters  in  Algorithm  6  satisfy  the  following  condition:  for  some  a,  p  >  0  and 
0  <  7  <  2, 

f  P,  >-  >9(i  -  mjA,  +  &S(N  -  1)1  . 

<  for  1  —  1, . . . ,  N. 

{  \^(AJ A)  >  I^S(N  -  1) 

Then  Algorithm  6  converges  to  a  solution  to  the  problem  (5.1). 


(5.51) 
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Proof.  Let 


/ 


H  := 


AJA1 


V 


\ 


A] yAN  J 


if  IKTA||  <i  for  all  i  7^  j,  then  it  is  easy  to  show  the  following:  for  any  x  and  y, 


N 


N 


Px||2  =  ll^xif  +  ^XiAjAjXj  >  IIAxill2  llx*H llxi 


i=  1 
N 


i= 1 


>  \\Ai*i\\2  -  ~  l)||x||2  = 


and 


i=  1 


N 


2x  /1T /ly  =  2  Aj Ajyj  +  2  ^  xT A,-yi 


(5.52) 


(5.53) 


2=1 

TV 


>  2  5^  x,'  Aj A]Yj  -  26  l|x.t 


2=1 

N 

>  -J2a\\Ai 


*¥=3 


N 


2=1 


p6(N  -  l)||x||2  -  ^  l||A,y,f  -  U(N  -  l)||y||' 

1=1  a  P 


=  -  X 


[aH+p5{N- 1)1]  l|y  II  1)1]  ’  ^ai  P  >  0) 


(5.54) 


Using  the  above  inequalities,  we  have 

~(Xk  -  A k+1)TA(x.k  -  xfc+1)  =  2/3(x.k+1  -  x.*)ATA{x.k  -  xfc+1) 


7 


^fc+1  v*||2 

I  [a.ff+p<5(lV-l)I] 


/3||xfc-xfc+1||fiH+i6(JV_1)I], 


(5.55) 


and 


\\Xk-Xk 

Therefore, 


=  7  2/32P(x* 


—  72/32||xfc+1  —  X*||^_|5(Ar_ 


1)1]  • 


(5.56) 


|ufc-ufc+1"2 


s  >llx*  -  **Tg.  +  (2  -  7)/?||xfc+i  -  x*||fH_5(JV_1)I] 

H+pS(N- 1)1]  —  /5||x  '  —  X  +  ||[Iff+i(5(jv_i)I]- 


|xfc+1  -x*"2 


(5.57) 
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As  long  as  the  following  holds: 

|  Gx  y  £h+£6(n  -1)1, 

\  a  p  (5.58) 

[  (2  -  -  5{N  -  1)1]  >-  p[aH  +  P5(N  -  1)1], 

which  is  equivalent  to  the  condition  (5.51),  there  must  exist  some  r/  >  0  such  that 
(5.44)  and  (5.45)  hold.  Then  the  convergence  of  Algorithm  6  follows  immediately 
from  the  standard  analysis  of  contraction  methods  [31].  □ 

Remark  5.1.  The  conditions  in  Theorem  5.4  guarantee  that  (5.44)  holds,  i.e.,  there 
exists  some  rj  >  0  such  that 

||ufe  -  ufe+1|||  >  77  •  ||ufc  —  ufc+1||2. 

2  —  7  +  p 

But  the  matrix  S  is  not  necessarily  positive  semi- definite.  Also,  the  term  - 

2  —  7  —  a 

in  (5.51)  may  be  negative  for  some  a.  Then,  Xmin(Aj Af)  is  allowed  to  be  0,-  in  other 
words,  Ai  may  not  be  of  full  column  rank.  As  long  as  the  conditions  in  (5.51)  are 
satisfied,  Algorithm  6  will  converge. 


5.3.2  Convergence  Rate  of  o(l/k) 


Next,  we  shall  establish  the  o(l/k)  convergence  rate  of  Jacobi- Proximal  ADMM. 
We  use  the  quantity  ||ufc  —  ufc+1|||^  as  a  measure  of  the  convergence  rate  motivated 
by  [33,35].  Here,  we  define  the  matrix  M  by 


M  := 


Mm 


\ 


1  T 

V  "A1/ 


and  Mx  :=  Gx  —  f3AT A. 
Theorem  5.5.  If  S  >-  0  arid  Mx  >z  0,  then 


u  —  u 


I M 


=  o(l /k), 
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and,  thus, 

||xfc  -  xfe+1||^  =  o(l/k)  and  ||Afc  -  Afc+1||2  =  o{l/k). 

We  need  the  following  monotonic  property  of  the  iterations: 

Lemma  5.3.  If  Mx  >z  0  and  0  <  7  <  2,  then 

||ufc-ufe+1||^<  ||ufe-1-ufc||2M.  (5.59) 

Proof.  Let  Ax.k+1  =  xf  -  x.k+1,i  =  1  , . . . ,  A,  Axfc+1  =  xfc  -  xfc+1,  and  AAfc+1  = 
Xk  —  Xk+1.  By  Lemma  2.1,  the  optimality  conditions  (5.40)  at  k-th  and  ( k  +  l)-th 
iterations  yield 

(AiAx.k+1,  AXk-fiAAx.k+1-f3  ^(AxJ-AxJ+1))+(Axf+1)TPi(Axf^Axf+1)  >  0. 

j¥* 

(5.60) 

Summing  up  over  all  i  and  rearranging  the  terms,  we  have 

(AAxfc+1,  AAfc)  >  ||Axfc+1||^  -  (Axfc)T(Gx  -  (3ATA) Axfc+1.  (5.61) 

Since  Mx  :=  Gx  —  (3AT A  P  0,  we  have 

2(Axl)T(GI  -  PA1  A)Axb+l  <  ||  Ax‘117  +  ||Ax‘fc,,  (5.62) 

and  thus 

2(2tAx‘+1,  AA*)  >  ||Axt+1||^_„,  -  HAx‘111,, 

=  l|Ax‘+1H7+^TJ  -  ||AxtH5J_.  (5.63) 

Note  that  AAfc+1  =  AAfc  —  7/3AAxfc+1.  It  follows  that 

-J*-||AAfc||2  -  ^-||AAfc+1||2  =  2(AAxfc+1,AAfc)  -7/d||AAxfc+1||2 

7  P  IP 

>  ||  Axfc+1||Q3;+(1_/y-)|gATJ4  “  ||  Axfc||^,  (5.64) 
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or  equivalently, 

(IIAx‘111,,  +  -0|AA*||2)  -  (||Ax‘+'||„,  +  -1;||AA*+1||2) 

^  A/?  (5.65) 

>  l|Ax‘+1||  (2-7  )/3ArA  —  0) 

which  completes  the  proof.  jg| 

Proof  of  Theorem  5.5.  By  Lemma  5.2  and  S  >~  0,  there  must  exist  //  >  0  such 
that 


u 


u 


\h 


u 


fe+1 


u 


Ig> 


ii- 


ii 


fc+l||2 


I G  — 


> 


U 


u 


fe+1 1|  2 
I  M‘ 


(5.66) 


Summing  (5.66)  over  k  gives 

OO 

^  ||ufc  —  ufc+1||^- <  oo.  (5.67) 

k= 1 

On  the  other  hand,  Lemma  5.3  implies  the  monotone  non- increasing  of  ||ufc  — ufc+1|||^. 
By  Lemma  3.2,  we  have  ||ufc  —  ufc+1||^  =  o(l/fc),  which  completes  the  proof.  □ 


5.3.3  Adaptive  Parameter  Tuning 

The  parameters  satisfying  the  condition  (5.46)  may  be  rather  conservative,  because 
the  inequality  (5.49)  for  bounding  ||u|||  is  usually  very  loose.  In  practice,  we  can 
compute  ||ufc  —  ufc+1|||  exactly  at  very  little  extra  cost.  If  ||ufc  —  ufc+1|||  >  0,  then 
Lemma  5.2  assures  the  decreasing  of  the  solution  error  (in  the  G-norm)  so  that  the 
current  parameters  are  acceptable.  On  the  other  hand,  if  ||ufc  —  ufc+1|||  <  0,  then 
the  matrix  S  is  not  positive  definite,  suggesting  that  the  current  parameters  Pt,  i  = 
1,2, . . . ,  N  may  be  too  small.  So  we  should  make  {Pi}  bigger  until  ||ufc  —  ufc+1|||  > 
0  holds.  Therefore,  we  propose  a  practical  strategy  for  adaptively  adjusting  the 
parameters  {Pi}  based  on  the  value  of  ||ufc  —  ufc+1|||: 
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1  Initialize  with  small  Pf  >z  0  (i  —  1,  2, . . . ,  N)  and  a  small  rj  >  0; 

2  for  k  =  1,  2, ...  do 
if  ||ufc_1  —  ufc|||  >  V  '  || —  ufc||2  then 

P-+ 1  <-  Pi f,  V*; 

else 

Increase  Pf  pf+1  4-  otiPk  +  faQi  (a*  >  1,  A  >  0,  Q*  >-  0),  Vi; 
Restart:  uk  4—  ufc_1; 


The  above  strategy  starts  with  relatively  small  proximal  parameters  {Pi}  and 
gradually  increase  them.  By  Theorem  5.3,  we  know  that  when  the  parameters  {-P*} 
are  large  enough  for  (5.46)  to  hold,  the  condition  (5.44)  will  be  satisfied  (for  sufficiently 
small  77).  Therefore,  the  adjustment  of  {-P?:}  cannot  occur  infinite  times.  After  a  finite 
number  of  iterations,  {Pi}  will  remain  constant  and  the  contraction  property  (5.45) 
of  the  iterations  will  hold.  Therefore,  the  convergence  of  such  an  adaptive  parameter 
tuning  scheme  follows  immediately  from  our  previous  analysis. 

Theorem  5.6.  Suppose  the  matrices  Pi  [i  =  1,2,...,  N )  in  Algorithm  6  are  adap¬ 
tively  adjusted  using  the  above  scheme.  Then  the  algorithm  converges  to  a  solution  to 
the  problem  (5.1). 

Empirical  evidence  shows  that  the  parameters  {Pi}  typically  adjust  themselves 
only  during  the  first  few  iterations  and  then  remain  constant  afterwards.  Alterna¬ 
tively,  one  may  also  decrease  the  parameters  after  every  few  iterations  or  after  they 
have  not  been  updated  for  a  certain  number  of  iterations.  But  the  total  times  of  de¬ 
crease  should  be  bounded  to  guarantee  convergence.  By  using  this  adaptive  strategy, 
the  resulting  parameters  {Pi}  are  usually  much  smaller  than  those  required  by  the 
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condition  (5.46),  thereby  leading  to  substantially  faster  convergence  in  practice. 

5.4  Numerical  Experiments 

In  this  section,  we  present  numerical  results  to  compare  the  performance  of  the  fol¬ 
lowing  parallel  splitting  algorithms: 

•  Prox-JADMM:  proposed  Jacobi-Proximal  ADMM  (Algorithm  6); 

•  VS  ADMM:  Variable  Splitting  ADMM  (Algorithm  3); 

•  Corr-JADMM:  Jacobi  ADMM  with  correction  steps  [33].  At  every  iteration, 
it  first  generates  a  “predictor”  ufc+1  by  an  iteration  of  Jacobi  ADMM  (Algorithm 
5)  and  then  corrects  ufc+1  to  generate  the  new  iterate  by: 

uk+1  =  uk -ak(uk -nk+1),  (5.68) 

where  ak  >  0  is  a  step  size.  In  our  experiments,  we  adopt  the  dynamically 
updated  step  size  ak  according  to  [33],  which  is  shown  to  converge  significantly 
faster  than  using  a  constant  step  size,  though  updating  the  step  size  requires 
extra  computation. 

•  YALL1:  one  of  the  state-of-the-art  solvers  for  the  Pi-minimization  problem. 

In  Section  5.4.1  and  5.4.2,  all  of  the  numerical  experiments  are  run  in  MATLAB 
(R2011b)  on  a  workstation  with  an  Intel  Core  i5-3570  CPUs  (3.40GHz)  and  32  GB 
of  RAM.  Section  5.4.3  gives  two  very  large  instances  that  are  solved  by  a  C/MPI 
implementation  on  Amazon  Elastic  Compute  Cloud  (EC2). 
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5.4.1  Exchange  Problem 

Consider  a  network  of  N  agents  that  exchange  n  commodities.  Let  x*  G  MJ1  (i  = 
1,2,  ...,N)  denote  the  amount  of  commodities  that  are  exchanged  among  the  N 
agents.  Each  agent  i  has  a  certain  cost  function  /*  :  Mn  — >•  M.  The  exchange  problem 
(see,  e.g.,  [4]  for  a  review)  is  given  by 


N 


N 


(5.69) 


which  minimizes  the  total  cost  among  N  agents  subject  to  an  equilibrium  constraint 
on  the  commodities.  This  is  a  special  case  of  (5.1)  where  —  1  and  c  =  0. 

We  consider  quadratic  cost  functions  /j(xj)  :=  |||C)xj  —  dj||2,  where  Ct  G  Mpxn  and 
di  G  Mp.  Then  all  the  compared  algorithms  solve  the  following  type  of  subproblems 
at  every  iteration: 


xf+1  —  argmin  (||CjX,;  -  di\\2  +  l||x,:  -  6f||2,  Vi  =  1, 2, . . . ,  N, 


(5.70) 


except  that  Prox-JADMM  also  adds  a  proximal  term  4 1 1 x,-  —  x^||p  .  Here  b\  G  Mm 
is  a  vector  independent  of  x*  and  takes  different  forms  in  different  algorithms.  For 
Prox-JADMM,  we  simply  set  P,  =  r,I  (r*  >  0).  Clearly,  each  Xj-subproblem  is  a 
quadratic  program  that  can  be  computed  efficiently  using  various  methods. 

In  our  experiment,  we  randomly  generate  x*,  i  —  1,2, . . . ,  N  —  1,  following  the 
standard  Gaussian  distribution,  and  let  x.*N  =  —  ^^x*.  Matrices  Ci  are  random 
Gaussian  matrices,  and  vectors  di  are  computed  by  di  =  GjX*.  Apparently,  x*  is  a 
solution  (not  necessarily  unique)  to  (5.69),  and  the  optimal  objective  value  is  0. 


The  penalty  parameter  /3  is  set  to  be  0.01,  1  and  0.01  for  Prox-JADMM,  VSADMM 
and  Corr-JADMM,  respectively.  They  are  nearly  optimal  for  each  algorithm,  picked 
out  of  a  number  of  different  values.  Note  that  the  parameter  for  VSADMM  is  quite 


different  from  the  other  two  algorithms  because  it  has  different  constraints  due  to 
the  variable  splitting.  For  Prox-JADMM,  the  proximal  parameters  are  initialized  by 
Tj  =  0.1(iV  —  1  )/3  and  adaptively  updated  by  the  strategy  in  Subsection  5.3.3;  the 
parameter  7  is  set  to  be  1. 

The  size  of  the  test  problem  is  set  to  be  n  =  100,  N  =  100,  p  =  80.  Letting 
all  the  algorithms  run  200  iterations,  we  plot  their  objective  value  /*(x*)  and 
residual  ||  ^^1Xj||2.  Note  that  the  per-iteration  cost  (in  terms  of  both  computation 
and  communication)  is  roughly  the  same  for  all  the  compared  algorithms.  Figure  5.1 
shows  the  comparison  result,  which  is  averaged  over  100  random  trials.  We  can  see 
that  Prox-JADMM  is  clearly  the  fastest  one  among  the  compared  algorithm. 


Figure  5.1  :  Exchange  problem  (n 


100,  N  =  100,  p  —  80). 
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5.4.2  Basis  Pursuit 

We  consider  the  -minimization  problem  for  finding  sparse  solutions  of  an  underde- 
termined  linear  system: 

min  ||x||i  s.t.  Ax  =  c,  (5-71) 

X 

where  x  G  Mn,  A  G  Mmxn  and  c  G  Mm  (m  <  n).  ft  is  also  known  as  the  basis 
pursuit  problem,  which  has  been  widely  used  in  compressive  sensing,  signal  and  image 
processing,  statistics,  and  machine  learning.  Suppose  that  the  data  is  partitioned  into 
N  blocks:  x  =  [x1;  x2, . . . ,  xtv]  and  A  =  [Ai,  A2, . . . ,  AN}.  Then  the  problem  (5.71) 
can  be  written  in  the  form  of  (5.1)  with  /j(xj)  =  ||xj||i. 

In  our  experiment,  a  sparse  solution  x*  is  randomly  generated  with  k  (k  • C  n) 
nonzeros  drawn  from  the  standard  Gaussian  distribution.  Matrix  A  is  also  randomly 
generated  from  the  standard  Gaussian  distribution,  and  it  is  partitioned  evenly  into 
N  blocks.  The  vector  c  is  then  computed  by  c  =  Ax*  +  r/,  where  ij  ~  A7(0,  <rJI)  is 
Gaussian  noise  with  standard  deviation  a. 

Prox-JADMM  solves  the  x,;-subproblenrs  with  Pj  =  r,;I  —  /3AJ A,  (i  —  1,2, ,  N) 
as  follows: 

=argmin  ||x^||i  + 

Xi 

=  arg  min  ||xj||i  +  (^Aj  ^Axfe  -  c  -  ^  ,  x^  +  ^  ||x*  -  x^||2  .  (5.72) 

Here,  we  choose  the  prox-linear  P^s  to  linearize  the  original  subproblems,  and  thus 
(5.72)  admits  a  simple  closed-form  solution  by  the  shrinkage  (or  soft-thresholding ) 
formula.  The  proximal  parameters  are  initialized  as  Tj  =  0  AN  ft  and  are  adaptively 
updated  by  the  strategy  discussed  in  Section  5.3.3. 


P 

2 


A,x, 


J2A^~c~ 


xk 

J 


+  2  Hx-  -  x. 
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Recall  that  VSADMM  needs  to  solve  the  following  Xj-subproblems: 

2 


fc+i  •  „  „  P 

=  arg mm  ||x?;||i  +  - 


A  x.  _  2^+! C_ 

*  *  *  N  p 


(5.73) 


Such  subproblems  are  not  easily  computable,  unless  x,  is  a  scalar  (i.e.,  nt  =  1)  or 
is  a  diagonal  matrix.  Instead,  we  solve  the  subproblems  approximately  using 
the  prox-linear  approach: 


x‘+1  =  arg  min  ||x,||i  +  (pAj  h^x?  -  z*+1  -  ,  x.  -  x? ^  ^  |jx.  -  xf||\ 

(5.74) 

which  can  be  easily  computed  by  the  shrinkage  operator.  We  set  =  1.01,d||  Aj||2  in 
order  to  guarantee  the  convergence,  as  suggested  in  [58]. 

Corr-JADMM  solves  the  following  x, -subproblems  in  the  “prediction”  step: 


xf+1  =  arg  min  ||Xj||i  +  ^ 


d,x,  +  y,  Ajx)  -  c  - 


Afc 

J 


(5.75) 


Because  the  correction  step  in  [33]  is  based  on  exact  minimization  of  the  subproblems, 
we  do  not  apply  the  prox-linear  approach  to  solve  the  subproblcms  approximately. 
Instead,  we  always  partition  x  into  scalar  components  (i.e.,  N  =  n)  so  that  the 
subproblems  (5.75)  can  still  be  computed  exactly.  The  same  penalty  parameter  /3  = 
10/ ||c|| i  is  used  for  the  three  algorithms.  It  is  nearly  optimal  for  each  algorithm, 
selected  out  of  a  number  of  different  values. 

In  addition,  we  also  include  the  YALL1  package  [61]  in  the  experiment,  which 
is  one  of  the  state-of-the-art  solvers  for  i\  minimization.  Though  YALL1  is  not 
implemented  in  parallel,  the  major  computation  of  its  iteration  is  matrix-vector  mul¬ 
tiplication  by  A  and  AT ,  which  can  be  easily  parallelized  (see  [48]).  Since  all  the 
compared  algorithms  have  roughly  the  same  amount  of  per-iteration  cost  (in  terms 
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(a)  Noise-free  (a  =  0)  (b)  Noise  added  (a  =  10  3) 

Figure  5.2  :  Basis  pursuit  (n  =  1000,  m  =  300,  k  =  60). 


of  both  computation  and  communication),  we  simply  let  all  the  algorithms  run  for  a 


fixed  number  of  iterations  and  plot  their  relative  error 


||x  -x  lb 

||x*  1 1 2 


Figure  5.2  shows  the  comparison  result  where  n  =  1000,  m  =  300,  k  =  60  and 
the  standard  deviation  of  noise  cr  is  set  to  be  0  and  10~3,  respectively.  For  Prox- 
JADMM  and  VSADMM,  we  set  N  =  100;  for  Corr-JADMM,  we  set  N  =  1000.  The 
results  are  average  of  100  random  trials.  We  can  see  that  Prox-JADMM  and  Corr- 
JADMM  achieve  very  close  performance  and  are  the  fastest  ones  among  the  compared 
algorithms.  YALL1  also  shows  competitive  performance.  However,  VSADMM  is  far 
slower  than  the  others,  probably  due  to  inexact  minimization  of  the  subproblems  and 
the  conservative  proximal  parameters. 


5.4.3  Distributed  Large-Scale  Basis  Pursuit 

In  previous  subsections,  we  described  the  numerical  simulation  of  a  distributed  im¬ 
plementation  of  Jacobi- Proximal  ADMM  that  was  carried  out  in  Matlab.  We  now 
turn  to  realistic  distributed  examples  and  solve  two  very  large  instances  of  the  l\- 
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minimization  problem  (5.71)  using  a  C  code  with  MPI  for  inter-process  communica¬ 
tion  and  the  GNU  Scientific  Library  (GSL)  for  BLAS  operations.  The  experiments 
are  carried  out  on  Amazon’s  Elastic  Compute  Cloud  (EC2). 

We  generate  two  test  instances  as  shown  in  Table  5.1.  Specifically,  a  sparse  so¬ 
lution  x*  is  randomly  generated  with  k  nonzeros  drawn  from  the  standard  Gaussian 
distribution.  Matrix  A  is  also  randomly  generated  from  the  standard  Gaussian  distri¬ 
bution  with  m  rows  and  n  columns,  and  it  is  partitioned  evenly  into  IV  =  80  blocks. 
Vector  c  is  then  computed  by  c  =  Ax*.  Note  that  A  is  dense  and  has  double  precision. 
For  Test  1  it  requires  over  150  GB  of  RAM  and  has  20  billion  nonzero  entries,  and  for 
Test  2  it  requires  over  337GB  of  RAM.  Those  two  tests  are  far  too  large  to  process 
on  a  single  PC  or  workstation.  We  want  to  point  out  that  we  cannot  find  a  dataset  of 
similar  or  larger  size  in  the  public  domain.  We  are  willing  to  test  our  a  larger  problem 
per  reader’s  request. 


Table  5.1  :  Two  large  datasets 


m 

n 

k 

RAM 

dataset  1 

1.0  x  105 

2.0  x  105 

2.0  x  103 

150GB 

dataset  2 

1.5  x  105 

3.0  x  105 

3.0  x  103 

337GB 

We  solve  the  problem  using  a  cluster  of  10  machines,  where  each  machine  is  a 
“memory-optimized  instance”  with  68  GB  RAM  and  1  eight-core  Intel  Xeon  E5-2665 
CPU.  Those  instances  run  Ubuntu  12.04  and  are  connected  with  10  Gigabit  ethernet 
network.  Since  each  has  8  cores,  we  run  the  code  with  80  processes  so  that  each 
process  runs  on  its  own  core.  Such  a  setup  is  charged  for  under  $17  per  hour. 

We  solve  the  large-scale  t\  minimization  problems  with  a  C  implementation  that 
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matches  the  Matlab  implementation  in  the  previous  section.  The  implementation 
consists  of  a  single  hie  of  C  code  of  about  300  lines,  which  is  available  for  download 
on  our  personal  website. 


Table  5.2  :  Time  results  for  large  scale  basis  pursuit  examples 


150GB  Test 

337GB  Test 

Itr 

Time(s) 

Cost($) 

Itr 

Tirne(s) 

Cost($) 

Data  generation 

- 

44.4 

0.21 

- 

99.5 

0.5 

CPU  per  iteration 

- 

1.32 

- 

- 

2.85 

- 

Comm,  per  iteration 

- 

0.07 

- 

- 

0.15 

- 

Reach  10-1 

23 

30.4 

0.14 

27 

79.08 

0.37 

Reach  10-2 

30 

39.4 

0.18 

39 

113.68 

0.53 

Reach  10~3 

86 

112.7 

0.53 

84 

244.49 

1.15 

Reach  10-4 

234 

307.9 

1.45 

89 

259.24 

1.22 

The  breakdown  of  the  wall-clock  time  is  summarized  in  Table  5.2.  We  can  observe 
that  Jacobi  ADMM  is  very  efficient  in  obtaining  a  relative  low  accuracy,  which  is 
usually  sufficient  for  large-scale  problems.  We  want  to  point  out  that  the  basic  BLAS 
operations  in  our  implantation  can  be  further  improved  by  using  other  libraries  such 
as  hardware-optimized  BLAS  libraries  produced  by  ATLAS,  Armadillo,  etc.  Those 
libraries  might  lead  to  several  times  of  speedup*.  We  use  GSL  due  to  its  ease  of  use, 
so  the  code  can  be  easily  adapted  for  solving  similar  problems. 


http://nghiaho.com/?p=1726 


94 


Chapter  6 
Conclusion 

In  this  thesis,  we  have  mainly  developed  two  different  types  of  generalizations  to  the 
classic  alternating  direction  method  of  multipliers  (ADMM): 

•  A  Generalized  ADMM  with  simplified  subproblems.  The  Generalized 
ADMM  framework  allows  flexible  ways  of  solving  the  subproblems  efficiently.  It 
subsumes  many  variants  of  ADMM  such  as  prox-linear  ADMM,  gradient-descent 
ADMM  and  ADMM  with  Hessian  approximation.  We  establish  the  global  con¬ 
vergence  of  the  Generalized  ADMM  and  improve  an  existing  convergence  rate 
of  0(  1/k)  slightly  to  o(l/k).  Furthermore,  we  establish  its  linear  convergence 
rate  under  various  scenarios,  in  which  at  least  one  of  the  two  objective  functions 
is  strictly  convex  and  has  Lipschitz  continuous  gradient. 

•  A  parallel  and  multi-block  extension  to  ADMM.  We  extend  ADMM  from 
the  classic  Gauss-Seidel  setting  to  the  Jacobi  setting,  from  2  blocks  to  N  blocks. 
The  algorithm  can  be  implemented  in  a  fully  parallel  and  distributed  fashion, 
and  thus  is  particularly  attractive  for  solving  certain  problems  with  huge  and 
distributed  datasets.  However,  the  straightforward  extension  —  Jacobi  ADMM 
is  not  necessarily  convergent.  To  guarantee  its  convergence,  we  provide  a  suffi¬ 
cient  condition  on  the  constraint  coefficient  matrices  A,.  For  general  matrices 
Ai,  we  propose  the  Jacobi- Proximal  ADMM  by  adding  proximal  terms  of  differ¬ 
ent  kinds  to  the  subproblems.  We  show  that  the  algorithm  converges  globally 
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at  a  rate  of  o(l/k),  while  allowing  flexible  and  efficient  ways  of  solving  the 
subproblems.  Numerical  results  are  presented  to  demonstrate  the  efficiency  of 
the  proposed  algorithm  in  comparison  with  several  existing  parallel  algorithms. 
We  also  implemented  our  algorithm  on  Amazon  EC2,  an  on-demand  public 
computing  cloud,  and  report  its  performance  on  very  large-scale  basis  pursuit 
problems  with  distributed  data. 

These  generalizations  are  of  both  theoretical  and  practical  importance  to  the  ap¬ 
plications  of  ADMM.  From  the  theoretical  point  of  view,  the  generalizations  provide 
unified  frameworks  to  analyze  a  wide  variety  of  ADMM-based  algorithms  and  their 
variants  in  practice,  particularly  in  the  areas  of  compressive  sensing,  signal  and  image 
processing,  machine  learning  and  applied  statistics.  Our  convergence  analysis  makes 
various  meaningful  extensions  to  the  existing  convergence  theory,  as  well  as  obtaining 
better  convergence  rates. 

From  the  practical  point  of  view,  the  generalizations  introduce  better  flexibility 
and  efficiency  in  solving  the  subproblems,  thereby  making  the  algorithms  more  scal¬ 
able  for  many  large-scale  problems.  Our  analysis  provides  a  deeper  understanding  of 
the  underlying  convergence  mechanism  of  ADMM.  It  sheds  lights  on  how  to  design 
and  tune  the  algorithms  in  practice.  In  particular,  our  derived  linear  convergence 
rates  provide  insights  on  how  the  penalty  parameter  f3  affects  the  convergence  speed, 
thereby  providing  some  theoretical  guidance  for  choosing  f3.  Also,  the  convergence 
analysis  of  the  Jacobi-Proximal  ADMM  enables  us  to  design  a  way  of  adaptively 
tuning  the  proximal  parameters  to  make  its  convergence  much  faster. 

Besides  the  generalizations  discussed  in  this  thesis,  we  leave  many  other  general¬ 
izations  and  analyses  for  future  study.  We  list  a  few  topics  below: 


•  In  both  of  the  Generalized  ADMM  and  the  Jacobi-Proximal  ADMM  frame- 
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works,  the  proximal  terms  in  the  subproblems  are  restricted  to  be  quadratic 
terms.  The  quadratic  proximal  term  is  perhaps  most  commonly  used  and  is 
often  simple  enough  to  compute.  However,  we  may  make  further  generalization 
by  allowing  non- quadratic  proximal  terms  (such  as  Bregman  divergence).  In 
some  applications,  choosing  certain  non-quadratic  proximal  terms  wisely  may 
make  the  subproblems  simpler  to  solve  and  make  the  algorithm  run  faster  than 
using  quadratic  proximal  terms.  Further  research  is  needed  to  understand  the 
convergence  behaviors  when  using  non-quadratic  proximal  terms  and  how  to 
choose  these  proximal  terms  in  practice. 

•  The  penalty  parameter  f3  and  the  proximal  parameters  P/Q  (in  Generalized 
ADMM)  or  Pi,  i  —  1,  2, . . . ,  N  (in  Jacobi-Proximal  ADMM)  are  important  pa¬ 
rameters  affecting  the  performance  of  the  algorithms.  To  make  the  presentation 
of  our  analysis  simpler  to  understand,  we  have  assumed  these  parameters  to  be 
fixed  over  iterations  for  most  part  of  the  thesis.  Empirical  studies  have  shown 
that  updating  these  parameters  properly  during  the  iterations  can  often  greatly 
improve  the  performance  of  the  algorithm  and  make  it  more  robust  to  the  initial 
values  of  the  parameters.  In  Section  5.3.3,  we  have  briefly  discussed  a  simple 
method  for  adjusting  the  proximal  parameters  of  the  Jacobi-Proximal  ADMM, 
while  the  penalty  parameter  is  assumed  to  be  fixed.  It  remains  open  how  to 
effectively  update  the  penalty  parameter  in  coordination  with  the  adjustment 
of  proximal  parameters.  It  still  needs  further  investigation  and  improvement. 

•  For  the  Jacobi-Proximal  ADMM,  we  have  established  its  o(l/k)  rate  of  conver¬ 
gence  under  fairy  mild  conditions.  Analogous  to  our  linear  convergence  results 
for  the  Generalized  ADMM,  we  should  also  be  able  to  show  the  linear  conver- 
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gence  of  the  Jacobi-Proximal  AD  MM  under  certain  additional  assumptions  such 
as  strong  convexity  and  Lipschitz  continuous  gradients  of  the  objective  func¬ 
tions.  Also,  similar  results  might  be  obtained  for  the  Gauss-Seidel  multi-block 
extension  of  ADMM. 

•  Though  the  classic  ADMM  can  be  considered  as  a  special  case  of  the  Douglas- 
Rachford  Splitting  Method  (DRSM),  it  is  unclear  whether  one  can  establish 
equivalence  between  our  extensions  of  ADMM  (i.e.,  the  Generalized  ADMM 
and  the  Jacobi-Proximal  ADMM)  and  any  operator  splitting  methods  with 
skillfully  chosen  operators 

•  In  addition,  there  is  a  variety  of  interesting  ongoing  development  of  ADMM 
such  as  the  stochastic  ADMM  [45],  online  ADMM  [54,57],  and  asynchronous 
ADMM  [60].  ADMM  seems  to  be  a  powerful  and  promising  algorithmic  tool 
for  many  modern  “Big  Data”  applications  which  will  remain  for  future  research 
and  development. 
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